Human-Computer Interaction Lecture 8: Usability evaluation methods
Jan 12, 2016
Human-Computer Interaction
Lecture 8: Usability evaluation methods
Different kinds of system evaluation/research
• Analytic/Empirical– ‘Analytic’ means reasoning and working by analysis– ‘Empirical’ means making observations or measurements
• Formative/Summative– Formative research (earlier in a project) evaluates & refines ideas– Summative research (later in a project) tests & evaluates systems
• Qualitative/Quantitative– Qualitative data involves words (or pictures), and can provides broad /
detailed information about a small number of users and their context.– Quantitative data involves numbers, and can be used to compare data
from larger numbers of users, or measure some specific aspect of their behaviour.
Cognitive WalkthroughFormative Analytic Qualitative
Can be used earlier in project
No measurement or observation
Descriptive, not numerical
From cognitive theory of exploratory learning
• User sets a goal to be accomplished, in terms of the expected system capabilities.
• User searches interface for currently available actions.• User selects the action that seems likely to make progress
toward the goal.• User performs the selected action and evaluates the
feedback given by the system, looking for evidence that progress has been made.– The user learns what to do in future by observing what the system
does
Evaluation procedure
• Manually simulate an (imaginary) user carrying out the stages of the model.– relies on knowing enough about this person to anticipate their prior
knowledge / mental model.
• Evaluators move through task, telling a story about why user would choose each action.
• Evaluate the story according to:– user’s current goal.– accessibility of correct control.– quality of match between label and goal.– feedback after the action.
Cognitive walkthrough example
Prerequisites
• Type of user:– Wallace and Gromit fan
• Their knowledge:– Stereos & media players– Windows
• Representative tasks:– Play a CD– Play an MP3 file– Eject a CD
Correct action sequences
• Play a CD:– insert CD, wait for drive to read it, press play.
• Play an MP3 file:– open menu for further functions, open music library
browser, select album and track.
• Eject a CD:– open main window, press eject button.
Playing a CD
• Insert CD, wait for drive to read it:– Story: user starts player, inserts CD in drive. Tries to press
“Play”, but it isn’t available yet. Notices that the CD has not loaded yet. Waits, and CD then loads.
• Goal: Start player.• Accessibility: Controls not accessible until CD read.• Match: OK - controls greyed, status area blank, although no indication of
need to wait.• Feedback: Good - after
loading, controls areenabled, name of CDand track appears.
Playing a CD
• Press play:– Story: user inspects the available controls, notes similarity
to familiar audio controls. Presses play.• Goal: Start playing first track on CD.• Accessibility: Play button is very prominent.• Match: Good - looks like a play button in context.• Feedback: Good - button gets
recessed, track indicator moves,track counter starts, sound starts.
– (but may be a problem if thevolume is turned right down).
Playing an MP3 file
• Open menu for further functions:– Story: user will try various buttons before finding the menu
control. • Goal: Find MP3 functions.• Accessibility: Not directly accessible.• Match: Bad - nonstandard menu button.• Feedback: OK once menu discovered.
– Tooltips do help, although only if you know you need a menu.
Playing an MP3 file
• Open browser:– Goal: choose MP3 track to be played.– Match: Bad - should user choose menu item “Select from
Master Library” or “Browse Master Library”?
Playing an MP3 file
• Select album and track:• Accessibility: OK - tracks are clearly in a hierarchy.• Match: Good - conventional Windows tree browser.
Ejecting a CD
• Open main window:– Story: the user has seen the main window, knows it has an
eject button, but doesn’t know how to get there.• Goal: open main window.• Accessibility: the button is on the current display.• Match: ?
• Press eject button …
Ejecting a CD
GOMSFormative Analytic Quantitative
Can be used with partial
implementationNo measurement
or observation
Provides numerical data
GOMS: Goals, Operators, Methods, Selection
• Goals: what is the user trying to do?• Operators: what actions must they take?
– Home hands on keyboard or mouse– Key press & release (tapping keyboard or mouse button) – Point using mouse/lightpen etc
• Methods: what have they learned in the past?• Selection: how will they choose what to do?
– Mental preparation
Aim is to predict speed of interaction
• Which is faster? Make a word bold usinga) Keys only b) Font dialog
Keys-only method
<shift> +
+
+ + +
<ctrl> + b
Keys-only method
• Mental preparation: M• Home on keyboard: H• Mental preparation: M • Hold down shift: K• Press : K
• Press : K
• Press : K
• Press : K
• Press : K
• Press : K
• Press : K• Release shift: K• Mental preparation: M• Hold down control: K• Press b: K• Release control: K
Keys-only method
• 1 occurrence of H• 3 occurrences of M• 12 occurrences of K
0.401.35 * 30.28 * 127.81 seconds
Font dialog method
click,drag
release,move
click,move
release
move,click
move,click
Motion times from Fitts’ law
• From start of “The” to end of “cat” (t ~ k log (A/W)):– distance 110 pixels, target width 26 pixels, t = 0.88 s
• From end of “cat” to Format item on menu bar:– distance 97 pixels, target width 25 pixels, t = 0.85 s
• Down to the Font item on the Format menu:– distance 23 pixels, target width 26 pixels, t = 0.34 s
• To the “bold” entry in the font dialog:– distance 268 pixels, target width 16 pixels, t = 1.53 s
• From “bold” to the OK button in the font dialog:– distance 305 pixels, target width 20 pixels, t = 1.49 s
Font dialog method
• Mental preparation: M• Reach for mouse: H• Point to “The”: P • Click: K• Drag past “cat”: P• Release: K• Mental preparation: M• Point to menu bar: P• Click: K
• Drag to “Font”: P• Release: K• Mental preparation: M• Move to “bold”: P• Click: K• Release: K• Mental preparation: M• Move to “OK”: P• Click: K
Font dialog method
• 1 occurrence of H• 4 occurrences of M• 7 occurrences of K• 6 mouse motions P
• Total for dialog method:
• Total for keyboard method:
0.401.35 * 40.28 * 71.1 + 0.88 + 0.85 + 0.34 + 1.53 + 1.4913.95 secondsvs.7.81 seconds
Interviews and Ethnographic StudiesFormative Empirical Qualitative
Can be used from start of
projectInvolves
observation
Descriptive, not numerical
Structured interviews
• Additional to requirements definition meetings.
• Encourage participation from a range of users.
• Structured in order to:– collect data into common framework
– ensure all important aspects covered
• Newman & Lamming’s proposed structure:– activities, methods and connections
– measures, exceptions and domain knowledge
• Semi-structured interviews:– Ask further questions to probe topics of interest
Observational task analysis
• Less intrusive than interviews
• Potentially more objective
• Inspired huge debate between cognitive and sociological views of HCI: see Lucy Suchman
• Harder work:– transcription from video protocol
• relative duration of sub-tasks
• transitions between sub-tasks
• interruptions of tasks
– alternatively, transcription from audio recording
Ethnographic field studies
• Field observation to understand users and context• Division of labour and its coordination• Plans and procedures
– When do they succeed and fail?
• Where paperwork meets computer work• Local knowledge and everyday skills• Spatial and temporal organisation• Organisational memory
– How do people learn to do their work?– Do formal methods match reality?
• See Beyer & Holtzblatt, Contextual Design
Controlled ExperimentsSummative Empirical Quantitative
Suitable for end of project
Involves measurements
Provides numerical data
Controlled experiments
• Based on a number of observations:– How long did Fred take to order a CD from Amazon?– How many errors did he make?
• But every observation is different.• So we compare averages:
– over a number of trials– over a range of people (experimental participants)
• Results often have a normal distribution
(statistics: histograms & distributions)
number ofobservations
time
1
2
3
4
5 10 15 20 25 30 35 4540
-4 -3 -2 -1 0 1 2 3 4
normalisationmean
variance
log normalisation
“outlier”
Experimental treatments
• A treatment is some modification that we expect to have an effect on usability:– How long does Fred take to order a CD using this great
new interface, compared to the crummy old one?– Expected answer: usually faster, but not always
number ofobservation
trials
time taken to order CD(faster)
new old
Hypothesis testing
• Null hypothesis:– What is the probability that this amount of difference in
means could be random variation between samples?
– Hopefully very low (p < 0.01, or 1%)
– Use a statistical significance test, such as the t-test.
onlyrandomvariationobserved
observed effectprobably does
result fromtreatment
very significanteffect of
treatment
Sources of variation
• People differ, so quantitative approaches to HCI must be statistical.
• We must distinguish sources of variation:– The effect of the treatment - what we want to measure. – Individual differences between subjects (e.g. IQ).– Distractions during the trial (e.g. sneezing).– Motivation of the subject (e.g. Mondays).– Accidental intervention by experimenter (e.g. hints).– Other random factors.
• Good experimental design and analysis isolates these.
Effect size – means and error bars
• Difference of two means may be statistically significant (if sample has low variance), without being very interesting. – But mean differences must always be reported with a
confidence interval, or plotted with ‘error bars’
(mean) timeto order CD
newold
(mean) timeto order CD
newold
Experiment A: ‘significant’ but boring Experiment B: interesting, but treat with caution
Problems with controlled experiments
• Huge variation between people (~200%)• Mistakes mean huge variation in accuracy (~1000%)• Improvements are often small (~20%)• … or even negative (because new & unfamiliar)• Most people give up using a new product at learning
time anyway, so quantitative measures of ‘expert’ speed and accuracy performance may not be of great commercial interest– We don’t care if it’s slow, so long as users like it– (and user’s perception of speed is inaccurate anyway)
Surveys and QuestionnairesSelf-report measures
Surveys and questionnaires
• Standardised psychometric instruments can be used – To evaluate mental states such as fatigue, stress, confusion– To assess individual differences (IQ, introversion …)
• Alternatively, questionnaires can be used to collect subjective or self-report evaluation from users– as in market research / opinion polls– ‘I like this system’ (and my friend who made it)– ‘I found it intuitive’ (and I like my friend)
• This kind of data can be of limited value– Can be biased, and self-report is often inaccurate anyway– It’s hard to design questionnaires to avoid these problems
Questionnaire design
• Open questions … – Capture richer qualitative information– But require a coding frame to structure & compare data
• Closed questions … – Yes/No or Likert scale (opinion from 1 to 5)– Quantitative data easier to compare, but limited insight
• Collecting survey data via interviews gives more insight but questionnaires are faster– Can collect data from a larger sample– Remember to test questionnaires with a pilot study, as it’s easier to
get them wrong than with interviews
Product Field TestingSummative Empirical Qualitative
Suitable for end of project
Involves observation
Descriptive, not numerical
Product field testing
• Brings advantages of task analysis/ethnography to assessment and testing phases of product cycle.
• Case study: Intuit Inc.’s Quicken product– originally based on interviews and observation– follow-me-home programme after product release:
• random selection of shrink-wrap buyers;• observation while reading manuals, installing, using.
– Quicken success was attributed to the programme:• survived predatory competition from Microsoft Money• later valued at $15 billion.
Non-Evaluation
Bad evaluation techniques
• Purely affective reports: 20 subjects answered the question “Do you like this nice new user interface more than that ugly old one?”– Apparently empirical/quantitative– But probably biased – if friends or trying to please
• No testing at all: “It was deemed that more colours should be used in order to increase usability.”– Apparently formative/analytic– But subjective – since the author is the subject
• Introspective reports made by a single subject (often the programmer or project manager): “I find it far more intuitive to do it this way, and the users will too.”– Apparently analytic/qualitative– Both biased and subjective
Evaluation in Part II projects
Summary of analytic options (analysing your design)
• Cognitive Walkthrough– Normally used in formative contexts – if you do have a working
system, then why aren’t you observing a real user (far more informative than simulating/imagining one)?
– But Cognitive Walkthrough can be a valuable time-saving precaution before user studies start, to fix blatant usability bugs
• GOMS– unlikely you’ll have alternative detailed UI designs in advance– If you have a working system, a controlled observation is superior
• Cognitive Dimensions – better suited to less structured tasks than CW & GOMS, which rely on
predefined user goal & task structure
Summary of empirical options (collecting data)
• Interviews/ethnography– could be useful in formative/preparation phase
• Think-aloud / Wizard of Oz– valuable for both paper prototypes and working systems– can uncover usability bugs if analysed rigorously
• Controlled experiments– appears more ‘scientific’, but only:
• If you can measure the important attributes in a meaningful way• If you test significance and report confidence interval of observed means
• Questionnaires– be clear what you are measuring – is self-report accurate?
• Field Testing– controlled release (and data collection?) may be possible
• See human participants guidance for empirical methods
Evaluation options for non-interactive systems
• Should your evaluation be analytic or empirical?– How consistent / well-structured is your analytic framework?– What are you measuring & why? Are the measurements compatible with your
claims (validity)?• Should your evaluation be formative or summative in nature?
– If formative – couldn’t you finish your project?– If summative – are the criteria internal or external?
• Is your data quantitative or qualitative?– Descriptive aspects of the system, or engineering performance data?– If qualitative, how will you establish objectivity (i.e. that this is not simply your
own opinion)?
Evaluating students’ knowledge of HCI
2013/14 votes on course objectives
• Learn interesting stuff about humans• Prepare for professional life• See cool toys• Find an alternative perspective on CS• Take an opportunity to be more creative• Get easy marks in final exam
151566
111110106633
Options: the course contents
• Lecture 1: Scope of HCI• Lecture 2: Visual representation• Lecture 3: Text and gesture interaction• Lecture 4: Inference-based approaches• Lecture 5: Augmented and mixed reality• Lecture 6: Usability of programming languages• Lecture 7: User-centred design research• Lecture 8: Usability evaluation methods