Title Text - univie.ac.atvda.univie.ac.at/Teaching/HCI/15s/LectureNotes/10_LabStudies.pdf · Outline • Evaluation beyond usability tests • Controlled Experiments • Other Evaluation

Title Text

Evaluation: Controlled Experiments

Outline

• Evaluation beyond usability tests• Controlled Experiments• Other Evaluation Methods

• CHI 2014/2015 Cool stuff: A glimpse into recent HCI research

Evaluation Beyond Usability Tests

Usability Evaluation (last week)

• Expert tests / walkthroughs• Usability Tests with users

• Main goal: formative– identify usability problems– improve the tool

Summative Evaluation (focus today)

• How good is it? Useful?• Better than other tools?

Formative and Summative:Usually combined

6Evaluation over time

formative summative

Evaluation goals (summative)

• Generalizability– Results can be applied to other people

• Precision– We measured what we wanted to measure

(controlling factors that were not intended to study)

• Realism– Study context is realistic

... usually trade-off between them!

The selection of a research method depends on the research question and the object under study!

Controlled Experiments

Controlled experiment

• Or:– Laboratory Experiment – Lab study – User Study– A/B Testing (used in marketing)– …

• Precision• Generalizability (?)

• Overall goal– Reveal cause-effect relationships– e.g. smoking causes cancer

Scenario

Which is better?

13© Carpendale

Test it with users!

Hypothesis

• A precise problem statement• Example:

– H1 = Participants will buy more beer when using variant B than variant A

– Null-Hypothese H0 = no difference in beer purchase

Independent Variables

• Factors to be studied• Typical independent variables (in HCI)

– Different types of design– Task type: e.g., searching/browsing– Participant demographics: e.g., male/female – Different technologies: touch pad vs. keyboard

• Control of Independent Variable– Levels: The number of variables in each factor– Limited by the length of the study and the number of

participants• How different?

– Entire interfaces vs. very specific parts15

Control Environment

• Make sure nothing else could cause your effect

• Control confounding variables• Randomization!

Different Designs: Between-Subjects

• Divide the participants into groups, each group does one condition

• Randomize: Group Assignment• Potential problem?

Group 1

Group 2

Different Designs: Within-Subjects

• Everybody does all the conditions• Can account for individual differences and reduce noise (that’s

why it may be more powerful and requires less participants)• Severely limits the number of conditions, and even types of

tasks tested (may be able to workaround by having multiple sessions)

• Can lead to ordering effects —> Randomize Order

Dependent Variable

• The things that you measure• Performance indicators:

– task completion time, error rates, mouse movement…– (numbers of beers bought)

• Subjective participant feedback: – satisfaction ratings, closed-ended questions,

interviews…– questionnaires (HCI lecture last week)

• Observations: – behaviors, signs of frustrations…

• Specifying good tasks for controlled experiments is tricky– Specifically, if you are measuring performance criteria

• Task criteria– comparability for different interfaces– clear end point

• Example– usability test: >>buy a book for a 4 year old<<– controlled experiment: >>find and buy the book

‘The Gruffalo’<<20

Results: Application of Statistics

• Descriptive Statistics– Describes the data you gathered (e.g. visually)

• Inferential Statistics– Make predictions/inferences from your study to

the larger population

Descriptive statistics

• Central tendency– mean {1, 2, 4, 5}– median {15, 19, 22, 29, 33, 45, 50}– mode {12, 15, 22, 22, 22, 34, 34}

Descriptive statistics

• Central tendency– mean {1, 2, 4, 5} 3– median {15, 19, 22, 29, 33, 45, 50} 29– mode {12, 15, 22, 22, 22, 34, 34} 22

• Measures of spread– range– variance– standard deviation

23note: for inferential standard deviation N becomes (N-1) —> estimate for sampled population

Visualization of descriptive statistics

• Mean• 25/75% Quartiles• Min / Max• (alternative: with outliers)

e.g., Boxplot

Inferential statistics

• Goal: Generalize findings to the larger population

25http://www.latrobe.edu.au/psy/research/cognitive-and-developmental-psychology/esci

Excursus: Tragedy of the error bars

CI = Confidence intervals

SE = Standard Error (SD of the sampling distribution of the sample mean)

SD = Standard Deviation

Excursus: 95% Confidence intervals

• USE THEM!• Interpretation: We can be 95% confident that

the real mean lies within our confidence interval!

Null Hypothesis Testing

• Statistically significant results– p < .05– The probability that we incorrectly reject the

Null-Hypotheses• Many different tests

– t-test, ANOVA, …

Validity

• Is there a causal relationship?• Errors:

– Type I: False positives– Type II: False negatives

• Internal Validity– Are there alternate causes?

• External Validity– Can we generalize the study?– E.g. generalizable to the

larger population of undergrad students

type I

type IIguilty

notguilty

Internal Validity: Storks deliver babies!?

• R. Matthews, “Storks Deliver Babies”. Journal of Teaching Statistics, vol. 22, issue 2, pages 36-38, 2001;

• There is a correlation coefficient of r=0.62 (reasonably high)

• A statistical test can be employed that shows that this correlation is in fact significant (p = 0.008)

• What are the flaws?

Pragmatically …A step-by-step how-to

Experimental Procedure:Typical example

• Identify research hypothesis• Specify the design of the study• Think about statistics *before* you run the

study• Run a pilot study• Recruit participants• Run the actual data collection sessions• Analyze the data• Report the results

Experimental Procedure:Typical example

• Identify research hypothesis• Specify the design of the study• Think about statistics *before* you run the

study• Run a pilot study • Recruit participants • Run the actual data collection sessions • Analyze the data• Report the results

Run a pilot study

• … to test the study design• … to test the system• … to test the study instruments

Recruit participants

• Reflecting the larger population?– in the best case yes– pragmatic decision though

• How many?– Depends on effect size and study design--power

of experiment– Usually 15+ (per group)– Note: much higher than for usability test (~5)

Run the actual data collection process• System and instruments ready?• Greet participants• Introduce purpose of study and procedure

– or deliberately don’t– Don’t bias: “compare my interface vs. this other interface”,

• Get consent of the participants– ethics!

• Assign participants to specific experiment condition– according to pre-defined randomization method

• Introduction to system(s) and/or training tasks• Participants complete the actual tasks

– take measures of dependent variables• Participants answer questionnaire (if any)• Debriefing session• Payment (if any).

– monetary, coupons, chocolate 36

Report the results

• Introduction / motivation• Study design• Results• Discussion• Conclusions • References / Appendix

• See, for instance, Saul Greenberg’s recommendation:– http://pages.cpsc.ucalgary.ca/~saul/hci_topics/

assignments/controlled_expt/ass1_reports.html37

Other Evaluation Methods

Field Studies

• Realism

• Reveal: “a richer understanding by using a more holistic approach” (Carpendale, 08)

Qualitative Methods

• Observation Techniques– fly-on-wall techniques– interruptions by observer

• Interview Techniques– contextual?

Qualitative Methods as “Add-on”

Often controlled experiment +• Experimenter Observations• Collecting Participants Opinions• Think-Aloud Protocol (be careful!)

Helpful for...• Usability Improvement (cf. HCI three weeks ago) • New insights, explanation of unforeseen results, new

questions• Can help to confirm results

Qualitative Methods as Primary

• Pre-design studies– Rich understanding of a complex domain– Problems, challenges, domain language

• During-, Post-design studies– Case studies/ Field studies

Helpful for...• holistic understanding

Qualitative Methods as Primary

• In Situ Observations• Participatory Observations• Laboratory Observational Studies• Contextual Interviews• Focus Groups

Qualitative Challenges

• Sample Sizes– Doing intensive studies with a lot of participants?– Time? Data produced?

• Subjectivity– Social relationship?

• Analyzing the data– Grounded theory – Open and axial coding

New Ways of Evaluation

• Mechanical Turk (more and more popular)• Measuring brain activities• ...

Cool stuff from CHI 2015

Affordances++ (CHI 2015)

Fancy Hardware (CHI 2015)

Sustainability (CHI 2015)

And skin again (CHI 2015)

Dance floor (CHI 2015)

Socializing with robots (CHI 2015)

Cool visualization stuff (CHI 2015)

Cool stuff from CHI 2014

Older people (CHI 2014)

Pervasive Design (CHI 2014)

Understanding human factors (CHI 2014)

Visualization (CHI 2014)

Even more videos from CHI 2014

Healthcare Studies at Healthcare Human Factors (HHF)

laboratory in Toronto

https://www.youtube.com/watch?v=WxQLzdLjwp4

Cool Hardware Stuff

Sustainability

Title Text - univie.ac.atvda.univie.ac.at/Teaching/HCI/15s/LectureNotes/10_LabStudies.pdf · Outline • Evaluation beyond usability tests • Controlled Experiments • Other Evaluation

Documents

DPS serieselectronicavoltron.com/imgs/prods/ls-bkn-c 2p...

Machine Learning Torsten Möller -...

HP Laptop 15s-fq2000ns

Ac Acs Plus 7-11-15s+Opt11

HP Laptop 15s-fq2009ns

20081224 Values In Life Aurora 15s

API RP 15S

HP Laptop 15s-eq1035ns

HP Laptop 15s-fq2093ns

HP Laptop 15s-eq1010ns

API 15S Spoolable Composite Pipeline Systems - ND PSC

MO 21CJS 15S

Linear Models for Classiﬁcation -...

Chapter 15S Fresnel diffraction - Binghamton...

HP Laptop 15s-eq1069ns

Model 15s - increMental shaft...