Top Banner
Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat
25

Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Jan 04, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Evaluating Undo: Human-Aware Recovery Benchmarks

Aaron Brownwith Leonard Chung, Calvin Ling,

and William Kakes

January 2004 ROC Retreat

Page 2: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 2

Recap: ROC Undo

• We have developed & built a ROC Undo Tool– a recovery tool for human operators– lets operators take a system back in time to undo

damage, while preserving end-user work

• We have evaluated its feasibility via performance and overhead benchmarks

• Now we must answer the key question:– does Undo-based recovery improve dependability?

Page 3: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 3

Approach: Recovery Benchmarks

• Recovery benchmarks measure the dependability impact of recovery– behavior of system during recovery period– speed of recovery

recovery time

performability impact(performance, correctness)

fault/errorinjection

normal behavior

perf

orm

ab

ilit

y recoverycomplete

Page 4: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 4

What About the People?

• Existing recovery/dependability benchmarks ignore the human operator– inappropriate for undo, where human drives

recovery

• To measure Undo, we need benchmarks that capture human-driven recovery– by including people in the benchmarking

process

Aaron Brown
(which are few and far between for that matter)
Page 5: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 5

Outline

• Introduction

• Methodology– overview– faultload development– managing human subjects

• Evaluation of Undo

• Discussion and conclusions

Page 6: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 6

Methodology

• Combine traditional recovery benchmarks with human user studies – apply workload and faultload– measure system behavior during recovery from

faults– run multiple trials with a pool of human subjects

acting as system operators

• Benchmark measures system, not humans– indirectly captures human aspects of recovery

» quality of situational awareness, applicability of tools, usability & error-proneness of recovery procedures

Page 7: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 7

Human-Aware Recovery Benchmarks

• Key components– workload: reuse performance benchmark– faultload: survey plus cognitive walkthrough– metrics: performance, correctness, and availability– human operators: handle non-self-healing recovery

recovery time

performability impact(performance, correctness)

fault/errorinjection

normal behavior

perf

orm

ab

ilit

y recoverycomplete

• Key components– workload: reuse performance benchmark» faultload: survey plus cognitive walkthrough– metrics: performance, correctness, and availability» human operators: handle recovery tasks/tools

Page 8: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 8

Developing the Faultload

• ROC approach combines surveys and cognitive walkthrough– surveys to establish common failure modes,

symptoms, and error-prone administrative tasks» domain-specific, system-independent

– cognitive walkthrough to translate to system-specific faultload

• Faultload specifies generic errors and events– provides system-independence, broader applicability– cognitive walkthrough maps to system-specific faults

Page 9: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 9

Example: E-mail Service Faultload• Web-based survey of e-mail admins– core questions:

» “Describe any incidents in the past 3 months where data was lost or the service was unavailable.”

» “Describe any administrative tasks you performed in the past 3 months that were particularly challenging.”

– cost: 4 x $50 gift certificate to amazon.com» raffled off as incentive for participation

– response: 68 respondents from SAGE mailing list

Page 10: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 10

E-mail Survey Results

• Results

configurationdeployment/upgradeotherundoablenon-undoable

Common Tasks Challenging Tasks Lost e-mail problems

50%56%

25%

26% 17%

25%18%

31%

33%12%1%

6%

(151 total) (68 total) (12 total)

– results dominated by» configuration errors (e.g., mail filters)» botched software/platform upgrades» hardware & environmental failures

– Undo potentially useful for majority of problems

Page 11: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 11

From Survey to Faultload

• Cognitive walkthrough example: SW upgrade– platform: sendmail on linux– task: upgrade from sendmail-8.2.9 to sendmail-8.2.10– approach:

1. configure/locate existing sendmail-linux system2. clone system to test machine (or use virtual machine)3. attempt upgrade, identifying possible failure points

» benchmarker must understand system to do this4. simulate failures and select those that match symptom

report from task survey

– sample result: simulate failed upgrade that disables spam filtering by omitting -DMILTER compile-time flag

Page 12: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 12

Human-Aware Recovery Benchmarks

• Key components– workload: reuse performance benchmark– faultload: survey plus cognitive walkthrough– metrics: performance, correctness, and availability– human operators: handle non-self-healing recovery

recovery time

performability impact(performance, correctness)

fault/errorinjection

normal behavior

perf

orm

ab

ilit

y recoverycomplete

• Key components– workload: reuse performance benchmark» faultload: survey plus cognitive walkthrough– metrics: performance, correctness, and availability» human operators: handle recovery tasks/tools

Page 13: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 13

Human Subject Protocol

• Benchmarks structured as human trials

• Protocol– human subject plays the role of system operator– subjects complete multiple sessions– in each session:

» apply workload to test system» select random scenario and simulate problem» give human subject 30 minutes to complete recover

• Results reflect statistical average across subjects

Page 14: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 14

The Variability Challenge

• Must control human variability to get reproducible, meaningful results

• Techniques– subject pool selection– screening– training– self-comparison

» each subject faces same recovery scenario on all systems

» system’s score determined by fraction of subjects with better recovery behavior

» powerful, but only works for comparison benchmarks

Page 15: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 15

Outline

• Introduction

• Methodology

• Evaluation of Undo– setup– per-subject results– aggregate results

• Discussion and conclusions

Page 16: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 16

Evaluating Undo: Setup

• Faultload scenarios1. SPAM filter configuration error2. failed e-mail server upgrade3. simple software crash (undo not useful here)

• Subject pool (after screening)

– 12 UCB Computer Science graduate students

• Self-comparison protocol– each subject given same scenario in each of 2

sessions» undo available in first session only» imposes learning bias against undo, but lowers variability

Page 17: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 17

Sample Single User Result

• Undo significantly improves correctness– with some (partially-avoidable) availability cost

Without Undo With Undo

Co

rrec

tnes

s

0

1

SM

TP

Ava

ilab

ilit

y

0

1

Time (minutes)

0 5 10 15 20 25 30

IMA

PA

vail

abil

ity

0

1

Failure Recovery Period

Co

rrec

tnes

s

0

1

SM

TP

Ava

ilab

ilit

y

0

1

Time (minutes)

0 5 10 15 20 25 30

IMA

PA

vail

abil

ity

0

1

Failure Recovery Period

Page 18: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 18

Inco

rrec

tly-

han

dle

dm

essa

ges

0

50

100

150

200 With Undo (session 1)Without Undo (session 2)

Fai

led

SM

TP

Co

nn

ecti

on

s

0

50

100

150

200

Failure Scenario

1 1 1 2 2 2 2

Fai

led

IM

AP

Co

nn

ecti

on

s

0

50

100

150

200

Overall Evaluation

• Undo significantly improves correctness– and reduces variance

across operators– statistically-justified,

p-value 0.045

• Undo hurts IMAP availability– several possible

workarounds exist

• Overall, Undo has a positive impact on dependability

Sessions where Undo used

Page 19: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 19

Outline

• Introduction

• Methodology

• Evaluation of Undo

• Discussion and conclusions

Page 20: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 20

Discussion

• Undo-based recovery improves dependability– reduces incorrectly-handled mail in common

failure cases

• More can still be done– tweaks to Undo implementation will reduce

availability impact

• Benchmark methodology is effective at controlling human variability– self-comparison protocol gives statistically-justified

results with 9 subjects (vs 15+ for random design)

Page 21: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 21

Future Directions: Controlling Cost• Human subject experiments are still costly– recruiting and compensating participants– extra time spent on training, multiple benchmark runs– extra demands on benchmark infrastructure– less than a user study, more than a perf. benchmark

• A necessary price to pay!

• Techniques for cost reduction– best-case results using best-of-breed operator– remote web-based participation– avoid human trials: extended cognitive walkthrough

Page 22: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Evaluating Undo: Human-Aware Recovery Benchmarks

• For more info:– [email protected]– http://roc.cs.berkeley.edu/– paper:

A. Brown, L. Chung et al. “Dependability Benchmarking of Human-Assisted Recovery Processes.” Submitted to DSN 2004, June 2004.

Page 23: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Backup Slides

Page 24: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 24

Example: E-mail Service Faultload• Results of e-mail task survey

Lost E-mail

Operator error (8%)

Usererror (8%)

Externalresource (8%)

Software error (8%)

Hardware/Env’t (17%)

Unknown (8%)

(12 reports) Challenging Tasks

FilterInstallation

(37%)

PlatformChange/Upgrade(26%)

Tool Dev. (6%)

Config.(13%)

Other (6%)User Ed.(4%)

ArchitectureChanges (7%)

(68 total)

Configurationproblems (25%)

Upgrade-related (17%)

Page 25: Evaluating Undo: Human-Aware Recovery Benchmarks Aaron Brown with Leonard Chung, Calvin Ling, and William Kakes January 2004 ROC Retreat.

Slide 25

Full Summary Dataset

Inco

rrec

tly-

han

dle

dm

essa

ges

0

50

100

150

200

250F

aile

d S

MT

PC

on

nec

tio

ns

0

25

50

75

100

125

Failure Scenario

1 1 1 2 2 2 2 3 3 3 1 2

Fai

led

IMA

PC

on

nec

tio

ns

0

10

20

30Session 1: undo tool available Session 2: baseline

Undo used(in Session 1)

Undo not usedor completed