Human-Computer Interaction Lecture 8: Usability evaluation methods.

Human-Computer Interaction

Lecture 8: Usability evaluation methods

Different kinds of system evaluation/research

• Analytic/Empirical– ‘Analytic’ means reasoning and working by analysis– ‘Empirical’ means making observations or measurements

• Formative/Summative– Formative research (earlier in a project) evaluates & refines ideas– Summative research (later in a project) tests & evaluates systems

• Qualitative/Quantitative– Qualitative data involves words (or pictures), and can provides broad /

detailed information about a small number of users and their context.– Quantitative data involves numbers, and can be used to compare data

from larger numbers of users, or measure some specific aspect of their behaviour.

Cognitive WalkthroughFormative Analytic Qualitative

Can be used earlier in project

No measurement or observation

Descriptive, not numerical

From cognitive theory of exploratory learning

• User sets a goal to be accomplished, in terms of the expected system capabilities.

• User searches interface for currently available actions.• User selects the action that seems likely to make progress

toward the goal.• User performs the selected action and evaluates the

feedback given by the system, looking for evidence that progress has been made.– The user learns what to do in future by observing what the system

does

Evaluation procedure

• Manually simulate an (imaginary) user carrying out the stages of the model.– relies on knowing enough about this person to anticipate their prior

knowledge / mental model.

• Evaluators move through task, telling a story about why user would choose each action.

• Evaluate the story according to:– user’s current goal.– accessibility of correct control.– quality of match between label and goal.– feedback after the action.

Cognitive walkthrough example

Prerequisites

• Type of user:– Wallace and Gromit fan

• Their knowledge:– Stereos & media players– Windows

• Representative tasks:– Play a CD– Play an MP3 file– Eject a CD

Correct action sequences

• Play a CD:– insert CD, wait for drive to read it, press play.

• Play an MP3 file:– open menu for further functions, open music library

browser, select album and track.

• Eject a CD:– open main window, press eject button.

Playing a CD

• Insert CD, wait for drive to read it:– Story: user starts player, inserts CD in drive. Tries to press

“Play”, but it isn’t available yet. Notices that the CD has not loaded yet. Waits, and CD then loads.

• Goal: Start player.• Accessibility: Controls not accessible until CD read.• Match: OK - controls greyed, status area blank, although no indication of

need to wait.• Feedback: Good - after

loading, controls areenabled, name of CDand track appears.

Playing a CD

• Press play:– Story: user inspects the available controls, notes similarity

to familiar audio controls. Presses play.• Goal: Start playing first track on CD.• Accessibility: Play button is very prominent.• Match: Good - looks like a play button in context.• Feedback: Good - button gets

recessed, track indicator moves,track counter starts, sound starts.

– (but may be a problem if thevolume is turned right down).

Playing an MP3 file

• Open menu for further functions:– Story: user will try various buttons before finding the menu

control. • Goal: Find MP3 functions.• Accessibility: Not directly accessible.• Match: Bad - nonstandard menu button.• Feedback: OK once menu discovered.

– Tooltips do help, although only if you know you need a menu.

Playing an MP3 file

• Open browser:– Goal: choose MP3 track to be played.– Match: Bad - should user choose menu item “Select from

Master Library” or “Browse Master Library”?

Playing an MP3 file

• Select album and track:• Accessibility: OK - tracks are clearly in a hierarchy.• Match: Good - conventional Windows tree browser.

Ejecting a CD

• Open main window:– Story: the user has seen the main window, knows it has an

eject button, but doesn’t know how to get there.• Goal: open main window.• Accessibility: the button is on the current display.• Match: ?

• Press eject button …

Ejecting a CD

GOMSFormative Analytic Quantitative

Can be used with partial

implementationNo measurement

or observation

Provides numerical data

GOMS: Goals, Operators, Methods, Selection

• Goals: what is the user trying to do?• Operators: what actions must they take?

– Home hands on keyboard or mouse– Key press & release (tapping keyboard or mouse button) – Point using mouse/lightpen etc

• Methods: what have they learned in the past?• Selection: how will they choose what to do?

– Mental preparation

Aim is to predict speed of interaction

• Which is faster? Make a word bold usinga) Keys only b) Font dialog

Keys-only method

<shift> +

+

+ + +

<ctrl> + b

Keys-only method

• Mental preparation: M• Home on keyboard: H• Mental preparation: M • Hold down shift: K• Press : K

• Press : K

• Press : K

• Press : K

• Press : K

• Press : K

• Press : K• Release shift: K• Mental preparation: M• Hold down control: K• Press b: K• Release control: K

Keys-only method

• 1 occurrence of H• 3 occurrences of M• 12 occurrences of K

0.401.35 * 30.28 * 127.81 seconds

Font dialog method

click,drag

release,move

click,move

release

move,click

move,click

Motion times from Fitts’ law

• From start of “The” to end of “cat” (t ~ k log (A/W)):– distance 110 pixels, target width 26 pixels, t = 0.88 s

• From end of “cat” to Format item on menu bar:– distance 97 pixels, target width 25 pixels, t = 0.85 s

• Down to the Font item on the Format menu:– distance 23 pixels, target width 26 pixels, t = 0.34 s

• To the “bold” entry in the font dialog:– distance 268 pixels, target width 16 pixels, t = 1.53 s

• From “bold” to the OK button in the font dialog:– distance 305 pixels, target width 20 pixels, t = 1.49 s

Font dialog method

• Mental preparation: M• Reach for mouse: H• Point to “The”: P • Click: K• Drag past “cat”: P• Release: K• Mental preparation: M• Point to menu bar: P• Click: K

• Drag to “Font”: P• Release: K• Mental preparation: M• Move to “bold”: P• Click: K• Release: K• Mental preparation: M• Move to “OK”: P• Click: K

Font dialog method

• 1 occurrence of H• 4 occurrences of M• 7 occurrences of K• 6 mouse motions P

• Total for dialog method:

• Total for keyboard method:

0.401.35 * 40.28 * 71.1 + 0.88 + 0.85 + 0.34 + 1.53 + 1.4913.95 secondsvs.7.81 seconds

Interviews and Ethnographic StudiesFormative Empirical Qualitative

Can be used from start of

projectInvolves

observation


Structured interviews

• Additional to requirements definition meetings.

• Encourage participation from a range of users.

• Structured in order to:– collect data into common framework

– ensure all important aspects covered

• Newman & Lamming’s proposed structure:– activities, methods and connections

– measures, exceptions and domain knowledge

• Semi-structured interviews:– Ask further questions to probe topics of interest

Observational task analysis

• Less intrusive than interviews

• Potentially more objective

• Inspired huge debate between cognitive and sociological views of HCI: see Lucy Suchman

• Harder work:– transcription from video protocol

• relative duration of sub-tasks

• transitions between sub-tasks

• interruptions of tasks

– alternatively, transcription from audio recording

Ethnographic field studies

• Field observation to understand users and context• Division of labour and its coordination• Plans and procedures

– When do they succeed and fail?

• Where paperwork meets computer work• Local knowledge and everyday skills• Spatial and temporal organisation• Organisational memory

– How do people learn to do their work?– Do formal methods match reality?

• See Beyer & Holtzblatt, Contextual Design

Controlled ExperimentsSummative Empirical Quantitative

Suitable for end of project

Involves measurements

Provides numerical data

Controlled experiments

• Based on a number of observations:– How long did Fred take to order a CD from Amazon?– How many errors did he make?

• But every observation is different.• So we compare averages:

– over a number of trials– over a range of people (experimental participants)

• Results often have a normal distribution

(statistics: histograms & distributions)

number ofobservations

time

1

2

3

4

5 10 15 20 25 30 35 4540

-4 -3 -2 -1 0 1 2 3 4

normalisationmean

variance

log normalisation

“outlier”

Experimental treatments

• A treatment is some modification that we expect to have an effect on usability:– How long does Fred take to order a CD using this great

new interface, compared to the crummy old one?– Expected answer: usually faster, but not always

number ofobservation

trials

time taken to order CD(faster)

new old

Hypothesis testing

• Null hypothesis:– What is the probability that this amount of difference in

means could be random variation between samples?

– Hopefully very low (p < 0.01, or 1%)

– Use a statistical significance test, such as the t-test.

onlyrandomvariationobserved

observed effectprobably does

result fromtreatment

very significanteffect of

treatment

Sources of variation

• People differ, so quantitative approaches to HCI must be statistical.

• We must distinguish sources of variation:– The effect of the treatment - what we want to measure. – Individual differences between subjects (e.g. IQ).– Distractions during the trial (e.g. sneezing).– Motivation of the subject (e.g. Mondays).– Accidental intervention by experimenter (e.g. hints).– Other random factors.

• Good experimental design and analysis isolates these.

Effect size – means and error bars

• Difference of two means may be statistically significant (if sample has low variance), without being very interesting. – But mean differences must always be reported with a

confidence interval, or plotted with ‘error bars’

(mean) timeto order CD

newold

(mean) timeto order CD

newold

Experiment A: ‘significant’ but boring Experiment B: interesting, but treat with caution

Problems with controlled experiments

• Huge variation between people (~200%)• Mistakes mean huge variation in accuracy (~1000%)• Improvements are often small (~20%)• … or even negative (because new & unfamiliar)• Most people give up using a new product at learning

time anyway, so quantitative measures of ‘expert’ speed and accuracy performance may not be of great commercial interest– We don’t care if it’s slow, so long as users like it– (and user’s perception of speed is inaccurate anyway)

Surveys and QuestionnairesSelf-report measures

Surveys and questionnaires

• Standardised psychometric instruments can be used – To evaluate mental states such as fatigue, stress, confusion– To assess individual differences (IQ, introversion …)

• Alternatively, questionnaires can be used to collect subjective or self-report evaluation from users– as in market research / opinion polls– ‘I like this system’ (and my friend who made it)– ‘I found it intuitive’ (and I like my friend)

• This kind of data can be of limited value– Can be biased, and self-report is often inaccurate anyway– It’s hard to design questionnaires to avoid these problems

Questionnaire design

• Open questions … – Capture richer qualitative information– But require a coding frame to structure & compare data

• Closed questions … – Yes/No or Likert scale (opinion from 1 to 5)– Quantitative data easier to compare, but limited insight

• Collecting survey data via interviews gives more insight but questionnaires are faster– Can collect data from a larger sample– Remember to test questionnaires with a pilot study, as it’s easier to

get them wrong than with interviews

Product Field TestingSummative Empirical Qualitative

Suitable for end of project

Involves observation


Product field testing

• Brings advantages of task analysis/ethnography to assessment and testing phases of product cycle.

• Case study: Intuit Inc.’s Quicken product– originally based on interviews and observation– follow-me-home programme after product release:

• random selection of shrink-wrap buyers;• observation while reading manuals, installing, using.

– Quicken success was attributed to the programme:• survived predatory competition from Microsoft Money• later valued at $15 billion.

Non-Evaluation

Bad evaluation techniques

• Purely affective reports: 20 subjects answered the question “Do you like this nice new user interface more than that ugly old one?”– Apparently empirical/quantitative– But probably biased – if friends or trying to please

• No testing at all: “It was deemed that more colours should be used in order to increase usability.”– Apparently formative/analytic– But subjective – since the author is the subject

• Introspective reports made by a single subject (often the programmer or project manager): “I find it far more intuitive to do it this way, and the users will too.”– Apparently analytic/qualitative– Both biased and subjective

Evaluation in Part II projects

Summary of analytic options (analysing your design)

• Cognitive Walkthrough– Normally used in formative contexts – if you do have a working

system, then why aren’t you observing a real user (far more informative than simulating/imagining one)?

– But Cognitive Walkthrough can be a valuable time-saving precaution before user studies start, to fix blatant usability bugs

• GOMS– unlikely you’ll have alternative detailed UI designs in advance– If you have a working system, a controlled observation is superior

• Cognitive Dimensions – better suited to less structured tasks than CW & GOMS, which rely on

predefined user goal & task structure

Summary of empirical options (collecting data)

• Interviews/ethnography– could be useful in formative/preparation phase

• Think-aloud / Wizard of Oz– valuable for both paper prototypes and working systems– can uncover usability bugs if analysed rigorously

• Controlled experiments– appears more ‘scientific’, but only:

• If you can measure the important attributes in a meaningful way• If you test significance and report confidence interval of observed means

• Questionnaires– be clear what you are measuring – is self-report accurate?

• Field Testing– controlled release (and data collection?) may be possible

• See human participants guidance for empirical methods

Evaluation options for non-interactive systems

• Should your evaluation be analytic or empirical?– How consistent / well-structured is your analytic framework?– What are you measuring & why? Are the measurements compatible with your

claims (validity)?• Should your evaluation be formative or summative in nature?

– If formative – couldn’t you finish your project?– If summative – are the criteria internal or external?

• Is your data quantitative or qualitative?– Descriptive aspects of the system, or engineering performance data?– If qualitative, how will you establish objectivity (i.e. that this is not simply your

own opinion)?

Evaluating students’ knowledge of HCI

2013/14 votes on course objectives

• Learn interesting stuff about humans• Prepare for professional life• See cool toys• Find an alternative perspective on CS• Take an opportunity to be more creative• Get easy marks in final exam

151566

111110106633

Options: the course contents

• Lecture 1: Scope of HCI• Lecture 2: Visual representation• Lecture 3: Text and gesture interaction• Lecture 4: Inference-based approaches• Lecture 5: Augmented and mixed reality• Lecture 6: Usability of programming languages• Lecture 7: User-centred design research• Lecture 8: Usability evaluation methods

Human-Computer Interaction Lecture 8: Usability evaluation methods.

Documents

play button

imaginary user

cdinsert cd

cdpress play

user searches interface

good button

available controls

users current goal