Title Text - univie.ac.atvda.univie.ac.at/Teaching/HCI/15s/LectureNotes/10_LabStudies.pdf · Outline • Evaluation beyond usability tests • Controlled Experiments • Other Evaluation

Post on 15-Sep-2019

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Title Text

Evaluation: Controlled Experiments

1

Outline

• Evaluation beyond usability tests• Controlled Experiments• Other Evaluation Methods

• CHI 2014/2015 Cool stuff: A glimpse into recent HCI research

2

Evaluation Beyond Usability Tests

3

Usability Evaluation (last week)

• Expert tests / walkthroughs• Usability Tests with users

• Main goal: formative– identify usability problems– improve the tool

4

Summative Evaluation (focus today)

• How good is it? Useful?• Better than other tools?

5

Formative and Summative:Usually combined

6Evaluation over time

formative summative

Evaluation goals (summative)

7

• Generalizability– Results can be applied to other people

• Precision– We measured what we wanted to measure

(controlling factors that were not intended to study)

• Realism– Study context is realistic

... usually trade-off between them!

8

© McGrath / Carpendale

The selection of a research method depends on the research question and the object under study!

Controlled Experiments

9

Controlled experiment

• Or:– Laboratory Experiment – Lab study – User Study– A/B Testing (used in marketing)– …

10

Focus

11

• Precision• Generalizability (?)

• Overall goal– Reveal cause-effect relationships– e.g. smoking causes cancer

Scenario

12

A B

Which is better?

13© Carpendale

Test it with users!

Hypothesis

• A precise problem statement• Example:

– H1 = Participants will buy more beer when using variant B than variant A

– Null-Hypothese H0 = no difference in beer purchase

14

A B

Independent Variables

• Factors to be studied• Typical independent variables (in HCI)

– Different types of design– Task type: e.g., searching/browsing– Participant demographics: e.g., male/female – Different technologies: touch pad vs. keyboard

• Control of Independent Variable– Levels: The number of variables in each factor– Limited by the length of the study and the number of

participants• How different?

– Entire interfaces vs. very specific parts15

A

B

Control Environment

• Make sure nothing else could cause your effect

• Control confounding variables• Randomization!

16

A

B

Different Designs: Between-Subjects

• Divide the participants into groups, each group does one condition

• Randomize: Group Assignment• Potential problem?

17

A

B

Group 1

Group 2

Different Designs: Within-Subjects

• Everybody does all the conditions• Can account for individual differences and reduce noise (that’s

why it may be more powerful and requires less participants)• Severely limits the number of conditions, and even types of

tasks tested (may be able to workaround by having multiple sessions)

• Can lead to ordering effects —> Randomize Order

18

A

B

Dependent Variable

• The things that you measure• Performance indicators:

– task completion time, error rates, mouse movement…– (numbers of beers bought)

• Subjective participant feedback: – satisfaction ratings, closed-ended questions,

interviews…– questionnaires (HCI lecture last week)

• Observations: – behaviors, signs of frustrations…

19

Tasks

• Specifying good tasks for controlled experiments is tricky– Specifically, if you are measuring performance criteria

• Task criteria– comparability for different interfaces– clear end point

• Example– usability test: >>buy a book for a 4 year old<<– controlled experiment: >>find and buy the book

‘The Gruffalo’<<20

Results: Application of Statistics

• Descriptive Statistics– Describes the data you gathered (e.g. visually)

• Inferential Statistics– Make predictions/inferences from your study to

the larger population

21

Descriptive statistics

• Central tendency– mean {1, 2, 4, 5}– median {15, 19, 22, 29, 33, 45, 50}– mode {12, 15, 22, 22, 22, 34, 34}

22

Descriptive statistics

• Central tendency– mean {1, 2, 4, 5} 3– median {15, 19, 22, 29, 33, 45, 50} 29– mode {12, 15, 22, 22, 22, 34, 34} 22

• Measures of spread– range– variance– standard deviation

23note: for inferential standard deviation N becomes (N-1) —> estimate for sampled population

=

=

Visualization of descriptive statistics

24

• Mean• 25/75% Quartiles• Min / Max• (alternative: with outliers)

e.g., Boxplot

Inferential statistics

• Goal: Generalize findings to the larger population

25http://www.latrobe.edu.au/psy/research/cognitive-and-developmental-psychology/esci

Excursus: Tragedy of the error bars

26

CI = Confidence intervals

SE = Standard Error (SD of the sampling distribution of the sample mean)

SD = Standard Deviation

Excursus: 95% Confidence intervals

• USE THEM!• Interpretation: We can be 95% confident that

the real mean lies within our confidence interval!

27

Null Hypothesis Testing

• Statistically significant results– p < .05– The probability that we incorrectly reject the

Null-Hypotheses• Many different tests

– t-test, ANOVA, …

28

A B

Validity

• Is there a causal relationship?• Errors:

– Type I: False positives– Type II: False negatives

• Internal Validity– Are there alternate causes?

• External Validity– Can we generalize the study?– E.g. generalizable to the

larger population of undergrad students

29

type I

type IIguilty

notguilty

Internal Validity: Storks deliver babies!?

30

• R. Matthews, “Storks Deliver Babies”. Journal of Teaching Statistics, vol. 22, issue 2, pages 36-38, 2001;

• There is a correlation coefficient of r=0.62 (reasonably high)

• A statistical test can be employed that shows that this correlation is in fact significant (p = 0.008)

• What are the flaws?

Pragmatically …A step-by-step how-to

31

Experimental Procedure:Typical example

• Identify research hypothesis• Specify the design of the study• Think about statistics *before* you run the

study• Run a pilot study• Recruit participants• Run the actual data collection sessions• Analyze the data• Report the results

32

Experimental Procedure:Typical example

• Identify research hypothesis• Specify the design of the study• Think about statistics *before* you run the

study• Run a pilot study • Recruit participants • Run the actual data collection sessions • Analyze the data• Report the results

33

Run a pilot study

• … to test the study design• … to test the system• … to test the study instruments

34

Recruit participants

• Reflecting the larger population?– in the best case yes– pragmatic decision though

• How many?– Depends on effect size and study design--power

of experiment– Usually 15+ (per group)– Note: much higher than for usability test (~5)

35

Run the actual data collection process• System and instruments ready?• Greet participants• Introduce purpose of study and procedure

– or deliberately don’t– Don’t bias: “compare my interface vs. this other interface”,

• Get consent of the participants– ethics!

• Assign participants to specific experiment condition– according to pre-defined randomization method

• Introduction to system(s) and/or training tasks• Participants complete the actual tasks

– take measures of dependent variables• Participants answer questionnaire (if any)• Debriefing session• Payment (if any).

– monetary, coupons, chocolate 36

Report the results

• Introduction / motivation• Study design• Results• Discussion• Conclusions • References / Appendix

• See, for instance, Saul Greenberg’s recommendation:– http://pages.cpsc.ucalgary.ca/~saul/hci_topics/

assignments/controlled_expt/ass1_reports.html37

Other Evaluation Methods

38

Field Studies

39

• Realism

• Reveal: “a richer understanding by using a more holistic approach” (Carpendale, 08)

Qualitative Methods

• Observation Techniques– fly-on-wall techniques– interruptions by observer

• Interview Techniques– contextual?

40

Qualitative Methods as “Add-on”

Often controlled experiment +• Experimenter Observations• Collecting Participants Opinions• Think-Aloud Protocol (be careful!)

Helpful for...• Usability Improvement (cf. HCI three weeks ago) • New insights, explanation of unforeseen results, new

questions• Can help to confirm results

41

Qualitative Methods as Primary

• Pre-design studies– Rich understanding of a complex domain– Problems, challenges, domain language

• During-, Post-design studies– Case studies/ Field studies

Helpful for...• holistic understanding

42

Qualitative Methods as Primary

• In Situ Observations• Participatory Observations• Laboratory Observational Studies• Contextual Interviews• Focus Groups

43

Qualitative Challenges

• Sample Sizes– Doing intensive studies with a lot of participants?– Time? Data produced?

• Subjectivity– Social relationship?

• Analyzing the data– Grounded theory – Open and axial coding

44

New Ways of Evaluation

• Mechanical Turk (more and more popular)• Measuring brain activities• ...

45

Cool stuff from CHI 2015

46

Affordances++ (CHI 2015)

47

Fancy Hardware (CHI 2015)

48

Sustainability (CHI 2015)

49

And skin again (CHI 2015)

50

Dance floor (CHI 2015)

51

Socializing with robots (CHI 2015)

52

Cool visualization stuff (CHI 2015)

53

Cool stuff from CHI 2014

54

Older people (CHI 2014)

55

Pervasive Design (CHI 2014)

56

Understanding human factors (CHI 2014)

57

Visualization (CHI 2014)

58

59

60

Even more videos from CHI 2014

61

Healthcare Studies at Healthcare Human Factors (HHF)

laboratory in Toronto

62

https://www.youtube.com/watch?v=WxQLzdLjwp4

63

Cool Hardware Stuff

64

Sustainability

65

top related