Observation & Experiments Watch, listen, and learn…
Jan 17, 2016
Observation & Experiments
Watch, listen, and learn…
Observing Users
Qualitative & quantitative End users Experimental or naturalistic
One of the best ways to gather feedback about your interface
Watch, listen and learn as a person interacts with your system
Observation
Direct– In same room– Can be intrusive– Users aware of your
presence– May use 1-way mirror to
reduce intrusiveness
Indirect–Video (cameras) or app (software logging) recording–Reduces intrusiveness, but doesn’t eliminate it–Gives archival record, but can spend a lot of time reviewing it
Location
Observations may be– In lab - Maybe a specially built usability lab
Easier to control Can have user complete set of tasks
– In field Watch their everyday actions More realistic Harder to control other factors
ObservationRoom
This observation room equipped with three monitors to view participant, participant's monitor, and composite picture in picture.
One-way mirror plus angled glass captures light and isolates sound between rooms.
Comfortable and spacious for three people, but room enough for six seated observers.
Digital mixer for unlimited mixing of input images and recording to VHS, SVHS, or MiniDV recorders.
Other examples: http://www.noldus.com/site/doc200406061
Task Selection
What tasks are people performing?– Representative and realistic?– Tasks dealing with specific parts of the interface
you want to test?– Problematic tasks?
Don’t forget to pilot your entire evaluation!!– A story
Engaging Users in Evaluation
What’s going on in the user’s head? Use verbal protocol where users describe their
thoughts
Qualitative techniques– Think-aloud - can be very helpful– Post-hoc verbal protocol - review video– Critical incident logging - positive & negative– Structured interviews - good questions
“What did you like best/least?” “How would you change..?”
Think Aloud
User describes verbally what s/he is thinking and doing
– What they believe is happening– Why they take an action– What they are trying to do
Widely used, popular protocol Potential problems:
– Can be awkward for participant– Thinking aloud can modify way user performs task
Cooperative approach
Another technique: Co-discovery learning (Constructive iteration)
– Join pairs of participants to work together– Use think aloud– Perhaps have one person be semi-expert (coach) and one
be novice– More natural (like conversation) so removes some
awkwardness of individual think aloud Variant: let coach be from design team (cooperative
evaluation)
Alternative
What if thinking aloud during session will be too disruptive?
Can use post-event protocol– User performs session, then watches video
afterwards and describes what s/he was thinking– Sometimes difficult to recall– Opens up door of interpretation
What if a user gets stuck?
Decide ahead of time what you will do.– Offer assistance or not? What kind of assistance?
You can ask (in cooperative evaluation)– “What are you trying to do..?”– “What made you think..?”– “How would you like to perform..?”– “What would make this easier to accomplish..?”– Maybe offer hints– This is why cooperative approaches are used
Inputs / Outcomes
Need operational prototype– could use Wizard of Oz simulation
What you get out– “process” or “how-to” information– Errors, problems with the interface– compare user’s (verbalized) mental model to
designer’s intended model
Capturing a Session
Paper & pencil– Can be slow– May miss things– Is definitely cheap and easy
Time 10:00 10:03 10:08 10:22
Task 1 Task 2 Task 3 …
Se
Se
Capturing a Session
Recording (screen, audio and/or video)– Good for think-aloud– Multiple cameras may be needed– Good, rich record of session– Can be intrusive– Can be painful to transcribe and analyze
Usability software:– Morae by Techsmith– Ovo Studios– Screencorder and other screen recording applications
Capturing a Session
Software logging– Modify software to log user actions– Can give time-stamped key press or mouse event– Two problems:
May be too low-level, want higher level events Massive amount of data, need analysis tools
Example logs
2303761098721869683|hrichter|1098722080134|MV|START|5662303761098721869683|hrichter|1098722122205|MV|QUESTION|false|false|false|false|false|false| 2303761098721869683|hrichter|1098724978982|MV|TAB|AGENDA2303761098721869683|hrichter|1098724981146|MV|TAB|PRESENTATION2303761098721869683|hrichter|1098724985161|MV|SLIDECHANGE|52303761098721869683|hrichter|1098724986904|MV|SEEK|PRESENTATION-A|566|604189|02303761098721869683|hrichter|1098724996257|MV|SEEK|PRESENTATION-A|566|604189|6041892303761098721869683|hrichter|1098724998791|MV|SEEK|PRESENTATION-A|566|604189|6041892303761098721869683|hrichter|1098725002506|MV|TAB|AGENDA2303761098721869683|hrichter|1098725003848|MV|SEEK|AGENDA|566|149613|6041892303761098721869683|hrichter|1098725005981|MV|TAB|PRESENTATION2303761098721869683|hrichter|1098725007133|MV|SLIDECHANGE|32303761098721869683|hrichter|1098725009326|MV|SEEK|PRESENTATION|566|315796|1496132303761098721869683|hrichter|1098725011569|MV|PLAY|566|3157962303761098721869683|hrichter|1098725039850|MV|TAB|AV2303761098721869683|hrichter|1098725054241|MV|TAB|PRESENTATION2303761098721869683|hrichter|1098725056053|MV|SLIDECHANGE|22303761098721869683|hrichter|1098725057365|MV|SEEK|PRESENTATION|566|271191|3157962303761098721869683|hrichter|1098725064986|MV|TAB|AV2303761098721869683|hrichter|1098725083373|MV|TAB|PRESENTATION2303761098721869683|hrichter|1098725084534|MV|TAB|AGENDA2303761098721869683|hrichter|1098725085255|MV|TAB|PRESENTATION2303761098721869683|hrichter|1098725088690|MV|TAB|AV2303761098721869683|hrichter|1098725130500|MV|TAB|AGENDA2303761098721869683|hrichter|1098725139643|MV|TAB|AV2303761098721869683|hrichter|1098726430039|MV|STOP|566|2711912303761098721869683|hrichter|1098726432482|MV|END
Analysis
Many approaches Task based
– How do users approach the problem– What problems do users have– Need not be exhaustive, look for interesting cases
Performance based– Frequency and timing of actions, errors, task completion,
etc. Can be very time consuming!!
Experiments
Testing hypotheses…
Experiments
Test hypotheses in your design
Generally quantitative, experimental, with end users.
See 14.2.2
Types of Variables
Independent – What you’re studying, what you intentionally vary
(e.g., interface feature, interaction device, selection technique, design)
Dependent– Performance measures you record or examine
(e.g., time, number of errors)
“Controlling” Variables
Prevent a variable from affecting the results in any systematic way
Methods of controlling for a variable:– Don’t allow it to vary
e.g., all males– Allow it to vary randomly
e.g., randomly assign participants to different groups– Counterbalance - systematically vary it
e.g., equal number of males, females in each group
The appropriate option depends on circumstances
Hypotheses
What you predict will happen More specifically, the way you predict the dependent
variable (i.e., accuracy) will depend on the independent variable(s)
“Null” hypothesis (Ho)– Stating that there will be no effect– e.g., “There will be no difference in performance between
the two groups”– Data used to try to disprove this null hypothesis
Example
Do people complete operations faster with a black-and-white display or a color one?
– Independent - display type (color or b/w)– Dependent - time to complete task (minutes)– Controlled variables - same number of males and females
in each group, no colorblind users– Hypothesis: Time to complete the task will be shorter for
users with color display
– Ho: Timecolor = Timeb/w
– Note: Within/between design issues
Experimental Designs
Within Subjects Design– Every participant provides a score for all levels or
conditions Color B/WP1 12 secs. 17 secs.P2 19 secs. 15 secs.P3 13 secs. 21 secs....
Experimental Designs
Between Subjects– Each participant provides results for only one
condition
Color B/WP1 12 secs. P2 17 secs.P3 19 secs. P5 15 secs.P4 13 secs. P6 21 secs....
Within Subjects Designs
More efficient: – Each subject gives you more data - they complete more
“blocks” or “sessions” More statistical “power”:
– Each person is their own control Therefore, can require fewer participants May mean more complicated design to avoid “order
effects”– Participant may learn from first condition– Fatigue may make second performance worse– e.g. seeing color then b/w may be different from seeing b/w
then color
Between Subjects Designs
Fewer order effects Simpler design & analysis Easier to recruit participants (only one
session, less time) Less efficient
Defining Performance
Based on the task Specific, objective measures/metrics Examples:
– Speed (reaction time, time to complete)– Accuracy (errors, hits/misses)– Production (number of files processed)– Score (number of points earned)– …others…?
Preference, satisfaction, etc. (i.e. questionnaire response) are also valid measurements
What about subjects?
How many?– Book advice:at least 10– Other advice:6 subjects per experimental
condition– Real advice: depends on statistics
Relating subjects and experimental conditions– Within/between subjects design
Now What…?
Performed initial data inspection– Removed outliers, have general idea what occurred
Descriptive Statistics– Totals, Averages, Ranges, etc.
Subgroup Statistics Statistical Analysis
– T-test and others to determine significance
More in 2 weeks…
Feeding Back Into Design
What were the conclusions you reached? How can you improve on the design? What are quantitative benefits of the redesign?
– e.g. 2 minutes saved per transaction, which means 24% increase in production, or $45,000,000 per year in increased profit
What are qualitative, less tangible benefit(s)?– e.g. workers will be less bored, less tired, and therefore
more interested --> better cust. service
Example: Web Page Structure
Breadth or depth of linking better?– Condition 1: 8 x 8 x 8– Condition 2: 16 x 32– Condition 3: 32 x 16
19 experienced users, 8 search tasks for each condition. Tasks chosen randomly from possible 128.
Results:– Condition 2 fastest (mean 36s, SD 16)– Condition 1 slowest (mean 58 s, SD 23)
Implies breadth preferable to depth, although too many links could hurt performance
Larson & Czerwinski, 1998; see page 447 in ID
Questions:
What are independent variables? What are dependent variables? What could be hypothesis? Between or within subjects? What was controlled? What other data could you gather on this
topic? What other experiments could you do on
this topic?
Assignment reminder: Due Monday
Group evaluation plan: draft Expect at least following:
– Usability criteria– Expected methods
And which criteria each are evaluating
– A few details for each method Tasks you will perform, data you will gather Questions you will ask, etc.
Example: add video to IM voice chat?
Compare voice chat with and without video Plan an experiment:
– Compare message time or difficulty in communicating or frequency…
Consider:– Tasks– What data you want to gather– How you would gather– What analysis you would do after