Observation & Experiments Watch, listen, and learn…

Observation & Experiments

Watch, listen, and learn…

Observing Users

Qualitative & quantitative End users Experimental or naturalistic

One of the best ways to gather feedback about your interface

Watch, listen and learn as a person interacts with your system

Observation

Direct– In same room– Can be intrusive– Users aware of your

presence– May use 1-way mirror to

reduce intrusiveness

Indirect–Video (cameras) or app (software logging) recording–Reduces intrusiveness, but doesn’t eliminate it–Gives archival record, but can spend a lot of time reviewing it

Location

Observations may be– In lab - Maybe a specially built usability lab

Easier to control Can have user complete set of tasks

– In field Watch their everyday actions More realistic Harder to control other factors

ObservationRoom

This observation room equipped with three monitors to view participant, participant's monitor, and composite picture in picture.

One-way mirror plus angled glass captures light and isolates sound between rooms.

Comfortable and spacious for three people, but room enough for six seated observers.

Digital mixer for unlimited mixing of input images and recording to VHS, SVHS, or MiniDV recorders.

Other examples: http://www.noldus.com/site/doc200406061

Task Selection

What tasks are people performing?– Representative and realistic?– Tasks dealing with specific parts of the interface

you want to test?– Problematic tasks?

Don’t forget to pilot your entire evaluation!!– A story

Engaging Users in Evaluation

What’s going on in the user’s head? Use verbal protocol where users describe their

thoughts

Qualitative techniques– Think-aloud - can be very helpful– Post-hoc verbal protocol - review video– Critical incident logging - positive & negative– Structured interviews - good questions

“What did you like best/least?” “How would you change..?”

Think Aloud

User describes verbally what s/he is thinking and doing

– What they believe is happening– Why they take an action– What they are trying to do

Widely used, popular protocol Potential problems:

– Can be awkward for participant– Thinking aloud can modify way user performs task

Cooperative approach

Another technique: Co-discovery learning (Constructive iteration)

– Join pairs of participants to work together– Use think aloud– Perhaps have one person be semi-expert (coach) and one

be novice– More natural (like conversation) so removes some

awkwardness of individual think aloud Variant: let coach be from design team (cooperative

evaluation)

Alternative

What if thinking aloud during session will be too disruptive?

Can use post-event protocol– User performs session, then watches video

afterwards and describes what s/he was thinking– Sometimes difficult to recall– Opens up door of interpretation

What if a user gets stuck?

Decide ahead of time what you will do.– Offer assistance or not? What kind of assistance?

You can ask (in cooperative evaluation)– “What are you trying to do..?”– “What made you think..?”– “How would you like to perform..?”– “What would make this easier to accomplish..?”– Maybe offer hints– This is why cooperative approaches are used

Inputs / Outcomes

Need operational prototype– could use Wizard of Oz simulation

What you get out– “process” or “how-to” information– Errors, problems with the interface– compare user’s (verbalized) mental model to

designer’s intended model

Capturing a Session

Paper & pencil– Can be slow– May miss things– Is definitely cheap and easy

Time 10:00 10:03 10:08 10:22

Task 1 Task 2 Task 3 …

Se

Se

Capturing a Session

Recording (screen, audio and/or video)– Good for think-aloud– Multiple cameras may be needed– Good, rich record of session– Can be intrusive– Can be painful to transcribe and analyze

Usability software:– Morae by Techsmith– Ovo Studios– Screencorder and other screen recording applications

Capturing a Session

Software logging– Modify software to log user actions– Can give time-stamped key press or mouse event– Two problems:

May be too low-level, want higher level events Massive amount of data, need analysis tools

Example logs

2303761098721869683|hrichter|1098722080134|MV|START|5662303761098721869683|hrichter|1098722122205|MV|QUESTION|false|false|false|false|false|false| 2303761098721869683|hrichter|1098724978982|MV|TAB|AGENDA2303761098721869683|hrichter|1098724981146|MV|TAB|PRESENTATION2303761098721869683|hrichter|1098724985161|MV|SLIDECHANGE|52303761098721869683|hrichter|1098724986904|MV|SEEK|PRESENTATION-A|566|604189|02303761098721869683|hrichter|1098724996257|MV|SEEK|PRESENTATION-A|566|604189|6041892303761098721869683|hrichter|1098724998791|MV|SEEK|PRESENTATION-A|566|604189|6041892303761098721869683|hrichter|1098725002506|MV|TAB|AGENDA2303761098721869683|hrichter|1098725003848|MV|SEEK|AGENDA|566|149613|6041892303761098721869683|hrichter|1098725005981|MV|TAB|PRESENTATION2303761098721869683|hrichter|1098725007133|MV|SLIDECHANGE|32303761098721869683|hrichter|1098725009326|MV|SEEK|PRESENTATION|566|315796|1496132303761098721869683|hrichter|1098725011569|MV|PLAY|566|3157962303761098721869683|hrichter|1098725039850|MV|TAB|AV2303761098721869683|hrichter|1098725054241|MV|TAB|PRESENTATION2303761098721869683|hrichter|1098725056053|MV|SLIDECHANGE|22303761098721869683|hrichter|1098725057365|MV|SEEK|PRESENTATION|566|271191|3157962303761098721869683|hrichter|1098725064986|MV|TAB|AV2303761098721869683|hrichter|1098725083373|MV|TAB|PRESENTATION2303761098721869683|hrichter|1098725084534|MV|TAB|AGENDA2303761098721869683|hrichter|1098725085255|MV|TAB|PRESENTATION2303761098721869683|hrichter|1098725088690|MV|TAB|AV2303761098721869683|hrichter|1098725130500|MV|TAB|AGENDA2303761098721869683|hrichter|1098725139643|MV|TAB|AV2303761098721869683|hrichter|1098726430039|MV|STOP|566|2711912303761098721869683|hrichter|1098726432482|MV|END

Analysis

Many approaches Task based

– How do users approach the problem– What problems do users have– Need not be exhaustive, look for interesting cases

Performance based– Frequency and timing of actions, errors, task completion,

etc. Can be very time consuming!!

Experiments

Testing hypotheses…

Experiments

Test hypotheses in your design

Generally quantitative, experimental, with end users.

See 14.2.2

Types of Variables

Independent – What you’re studying, what you intentionally vary

(e.g., interface feature, interaction device, selection technique, design)

Dependent– Performance measures you record or examine

(e.g., time, number of errors)

“Controlling” Variables

Prevent a variable from affecting the results in any systematic way

Methods of controlling for a variable:– Don’t allow it to vary

e.g., all males– Allow it to vary randomly

e.g., randomly assign participants to different groups– Counterbalance - systematically vary it

e.g., equal number of males, females in each group

The appropriate option depends on circumstances

Hypotheses

What you predict will happen More specifically, the way you predict the dependent

variable (i.e., accuracy) will depend on the independent variable(s)

“Null” hypothesis (Ho)– Stating that there will be no effect– e.g., “There will be no difference in performance between

the two groups”– Data used to try to disprove this null hypothesis

Example

Do people complete operations faster with a black-and-white display or a color one?

– Independent - display type (color or b/w)– Dependent - time to complete task (minutes)– Controlled variables - same number of males and females

in each group, no colorblind users– Hypothesis: Time to complete the task will be shorter for

users with color display

– Ho: Timecolor = Timeb/w

– Note: Within/between design issues

Experimental Designs

Within Subjects Design– Every participant provides a score for all levels or

conditions Color B/WP1 12 secs. 17 secs.P2 19 secs. 15 secs.P3 13 secs. 21 secs....

Experimental Designs

Between Subjects– Each participant provides results for only one

condition

Color B/WP1 12 secs. P2 17 secs.P3 19 secs. P5 15 secs.P4 13 secs. P6 21 secs....

Within Subjects Designs

More efficient: – Each subject gives you more data - they complete more

“blocks” or “sessions” More statistical “power”:

– Each person is their own control Therefore, can require fewer participants May mean more complicated design to avoid “order

effects”– Participant may learn from first condition– Fatigue may make second performance worse– e.g. seeing color then b/w may be different from seeing b/w

then color

Between Subjects Designs

Fewer order effects Simpler design & analysis Easier to recruit participants (only one

session, less time) Less efficient

Defining Performance

Based on the task Specific, objective measures/metrics Examples:

– Speed (reaction time, time to complete)– Accuracy (errors, hits/misses)– Production (number of files processed)– Score (number of points earned)– …others…?

Preference, satisfaction, etc. (i.e. questionnaire response) are also valid measurements

What about subjects?

How many?– Book advice:at least 10– Other advice:6 subjects per experimental

condition– Real advice: depends on statistics

Relating subjects and experimental conditions– Within/between subjects design

Now What…?

Performed initial data inspection– Removed outliers, have general idea what occurred

Descriptive Statistics– Totals, Averages, Ranges, etc.

Subgroup Statistics Statistical Analysis

– T-test and others to determine significance

More in 2 weeks…

Feeding Back Into Design

What were the conclusions you reached? How can you improve on the design? What are quantitative benefits of the redesign?

– e.g. 2 minutes saved per transaction, which means 24% increase in production, or $45,000,000 per year in increased profit

What are qualitative, less tangible benefit(s)?– e.g. workers will be less bored, less tired, and therefore

more interested --> better cust. service

Example: Web Page Structure

Breadth or depth of linking better?– Condition 1: 8 x 8 x 8– Condition 2: 16 x 32– Condition 3: 32 x 16

19 experienced users, 8 search tasks for each condition. Tasks chosen randomly from possible 128.

Results:– Condition 2 fastest (mean 36s, SD 16)– Condition 1 slowest (mean 58 s, SD 23)

Implies breadth preferable to depth, although too many links could hurt performance

Larson & Czerwinski, 1998; see page 447 in ID

Questions:

What are independent variables? What are dependent variables? What could be hypothesis? Between or within subjects? What was controlled? What other data could you gather on this

topic? What other experiments could you do on

this topic?

Assignment reminder: Due Monday

Group evaluation plan: draft Expect at least following:

– Usability criteria– Expected methods

And which criteria each are evaluating

– A few details for each method Tasks you will perform, data you will gather Questions you will ask, etc.

Example: add video to IM voice chat?

Compare voice chat with and without video Plan an experiment:

– Compare message time or difficulty in communicating or frequency…

Consider:– Tasks– What data you want to gather– How you would gather– What analysis you would do after

Observation & Experiments Watch, listen, and learn…

Documents

way user

interfacecompare users

users head

storyengaging users

cooperative approaches

selectionwhat tasks

problematic tasks

cooperative evaluationwhat