Carnegie Mellon Towards Fingerpointing in the Emulab Dynamic Distributed System Michael P. Kasick Priya Narasimhan Carnegie Mellon University Kevin Atkinson Jay Lepreau University of Utah November 5, 2006 Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 1 / 37
37
Embed
Towards Fingerpointing in the Emulab Dynamic … Fingerpointing in the Emulab Dynamic Distributed System Michael P. Kasick Priya Narasimhan Carnegie Mellon University Kevin Atkinson
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Carnegie Mellon
Towards Fingerpointing in the Emulab DynamicDistributed System
Michael P. KasickPriya Narasimhan
Carnegie Mellon University
Kevin AtkinsonJay Lepreau
University of Utah
November 5, 2006
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 1 / 37
Carnegie Mellon
Introduction to Emulab Classic
University of Utah:Flux Research GroupNetwork emulation testbed1300 users430 local nodes740 distributed nodesIn service for 6 years
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 2 / 37
Carnegie Mellon
Emulab’s Experiments
Users upload an experiment configuration (NS file)Configuration specifies virtual node topologyUsers granted full, exclusive access to nodesNodes automatically redelegated when experiments go idle
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 3 / 37
Carnegie Mellon
Emulab Software Infrastructure
Off-the-shelf componentsDatabase, OS, etc.
Custom developed componentsWeb interfaceTestbed setup & management490,000 lines of code
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 22 / 37
Carnegie Mellon
Error Context & Propagation
Context distinguishes between errors of the same typeNode boot failures across different nodesNode boot failures with different OSes
Propagation centers focus on relevant errorsNested scripts should propagate the primary errorOtherwise parent scripts generate “me-too” errorsSecondary (“me-too”) errors add noiseAchievable with exceptions (RPC, middleware)
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 23 / 37
Carnegie Mellon
Research Phase
Used tblog to identify a set of target errorsGoal was not to obtain 100% coverageSystem functionality is always expandingSmall portion of possible errors actually observed
Drafted error specifications and error typesRequired significant knowledge of errors and meaningEliminated error ambiguitiesIdentified relevant error specific context
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 24 / 37
Carnegie Mellon
Development Phase
Developed a prototype Perl reporting moduleStructured error reporting functionError parsers for C++ & TCL language components
Added reporting hooks for the target errorsProblem: Emulab provides no error propagationNested scripts return success or failure onlyFix: severity-level assignmentAlternative: tblog post-processing analysis
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 25 / 37
Carnegie Mellon
Testing & Deployment Phases
Tested prototype in elabinelabIntegrated prototype into tblog framework
New local analysis engine: tbreport
Deployed on the production Emulab testbed750 lines of added or changed code
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 26 / 37
Carnegie Mellon
Initial Results
Data collected August 16-24th, 2006681 swap-* sessions started
108 (17.3%) reported at least one error
283 total fatal errors reportedMany errors repeated for each node in a session118 unique instances of errors in a given session
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 27 / 37
Normalized errors (unique in a session) grouped by error type.
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 28 / 37
Carnegie Mellon
Node Shortage Failures
Second most common error (20.3%)Insufficient free nodes for experiment swap-inCurrent node availability is listed on website
Illustrates user demand48% due to lack of pc3000
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 29 / 37
Carnegie Mellon
Other Resource Shortage Failures
Most common error (26.3%)Sufficient free nodes to swap-in, but:
Attempted assignment violated mapping constraintsOften due to oversubscribed switch bandwidth
Assignment algorithm is non-deterministicUser cannot predict when these errors might occurLater attempts may succeed w/o topology change
Frequent resubmissions lead to further errors
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 30 / 37
Carnegie Mellon
Node-Boot Failures
Third most common error (18.6%)Node status daemon
Reports boot successTimeout results in error
Many underlying causesFaulty hardware, broken user contributed OS, etc.
Motivating scenario for our future research
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 31 / 37
Carnegie Mellon
Node-Boot Failure Example (I)
pc297cust_os1
Single node, one sessionUnknown culprit
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 32 / 37
Carnegie Mellon
Node-Boot Failure Example (II)pc297
cust_os1
cust_os1pc301
Two nodes, two sessions, same OSSuggests bad OS
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 33 / 37
Carnegie Mellon
Node-Boot Failure Example (III)pc297 pc297
cust_os1
cust_os1
cust_os2
cust_os2pc301 pc301+
Same two nodes, four sessions, different OSStrongly suggests bad OS
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 34 / 37
Carnegie Mellon
Node-Boot Failures: What’s Next?
Cannot diagnose root cause from a single traceOperator dilemma:
Assume node is faulty and quarantine?Assume OS is faulty and leave node as is?
Motivates global fingerpointing (future work)Correlation of multiple error instancesReliably fingerpoints the culprit
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 35 / 37
Carnegie Mellon
Summary
Manual diagnosis of system errors is costlytblog-style analysis aids in message filteringOpaque failure messages limits error usefulnessStructured error reports enable global analysisGlobal analysis fingerpoints errors with fine granularityFuture work:
Develop a global analysis engine for EmulabStart by targeting the identified node-boot failure scenarioTarget other real-world systems for error analysis
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 36 / 37
Carnegie Mellon
Further Reading
Michael P. Kasick, Priya Narasimhan, Kevin Atkinson, and Jay Lepreau.Towards fingerpointing in the Emulab dynamic distributed system.In Proceedings of the 3rd USENIX Workshop on Real, Large Distributed Systems(WORLDS ’06), Seattle, WA, November 2006.
Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, Shashi Guruprasad, Mac Newbold,Mike Hibler, Chad Barb, and Abhijeet Joglekar.An integrated experimental environment for distributed systems and networks.In Proceedings of the Fifth Symposium on Operating System Design and Implementation(OSDI ’02), pages 255–270, Boston, MA, December 2002.
Michael P. Kasick (Carnegie Mellon) Towards Fingerpointing in Emulab November 5, 2006 37 / 37