7/4/2013 1 Slide 1/40 Probabilistic Fault Detection and Diagnosis in Large-Scale Distributed Applications Ignacio Laguna PhD’s Final Examination Major Professor: Prof. Saurabh Bagchi Committee Members: Prof. Samuel Midkiff Prof. Y. Charlie Hu Martin Schulz Lawrence Livermore National Laboratory Nov 27, 2012 Slide 2/40 Bugs Cause Million of Dollars Lost in Minutes Amazon failure took ~6 hours to fix Need for automatic problem-determination techniques to reduce diagnosis time
28
Embed
Probabilistic Fault Detection and Diagnosis in Large-Scale ... · Probabilistic Fault Detection and Diagnosis in Large-Scale Distributed Applications Ignacio Laguna PhD’s Final
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
7/4/2013
1
Slide 1/40
Probabilistic Fault Detection and Diagnosis in Large-Scale Distributed Applications
Ignacio LagunaPhD’s Final Examination
Major Professor:Prof. Saurabh Bagchi
Committee Members:
Prof. Samuel Midkiff Prof. Y. Charlie Hu Martin Schulz
Lawrence Livermore National Laboratory
Nov 27, 2012
Slide 2/40
Bugs Cause Million of Dollars Lost in Minutes
Amazon failure took ~6 hours to fix
Need for automatic problem-determination techniques to reduce diagnosis time
7/4/2013
2
Slide 3/40
Failures in Large-Scale Applications are More Frequent
The more components the higher the failure rate
Bugs from many components:ApplicationLibraries
OS & Runtime system
Multiple manifestations:Hang, crash
Silent data corruptionApplication is slower than usual
Faults come from:HardwareSoftwareNetwork
Slide 4/40
Debuggers Need to Handle High Degree of Parallelism
• 100 million cores in Exascale HPC applications (in 2020)– 100 million different threads or processes executing simultaneously
• Most of the current parallel debuggers scale poorly– Bottleneck in handling data from many parallel processes– Data is analyzed in a central point (rather than distributed)– Generate too much data to analyze
Problems of Current Diagnosis/Debugging Techniques
• Poor scalability– Inability to handle large number of processes– Generate too much data to analyze– Analysis is centralized rather than distributed– Offline rather than onlineFlowChecker (SC’09), DMTRacker (SC’07), A. Vo (PACT'11)
• Problem determination is not automatic– Old breakpoint-based debugging (> 30 years old)– Too much human intervention– Requires large amount of domain knowledgeTotalView®, DDT®, GDB, D3S (NSDI’08), model checking (Crystal
ball – NSDI’09)
Slide 6/40
Focus of My Dissertation
Failure Detection Diagnosis Recovery
▪ Detect that a problem exists
▪ Root-cause analysis▪ Pinpoint faulty component
▪ Checkpointing▪ Micro-rebooting▪ Redeployment
Prelim ExamFault Detection in HPC and Commercial Applications
Papers:Supercomputing 2011
DSN 2010Middleware 2009
Final Exam
Problem Localization in HPC and Commercial Applications
Paper:PACT 2012
7/4/2013
4
Slide 7/40
Remaining Agenda
Problem Localization
Distributed Applications
Scientific ApplicationsMPI, OpenMP
CommercialApplicationsBigData, Java
Related Work and Conclusions
Scientific ApplicationsMPI, OpenMP
Slide 8/40
Some Failures Manifest Only at Large Scale
• Application hangs with 8,000 MPI tasks
• Manifestation is intermittent
• Large amount of time spent on fixing the problem
• Our technique isolated the problem origin in a few seconds
. . . . . 1 2 3 nxx yy ... xx yy ... xx yy ... xx yy
All tasks know the state of others
Build (locally) progress-dependence graph
Reduction of progress-dependence graphs
Time
Reductions are O(log #tasks)
1 2 3 n. . . . .
xx
yy zz
xx
yy zz
xx
yy zz
xx
yy zz
1Progress dependence graphxx
yy zz
7/4/2013
9
Slide 17/40
Examples of Reduction Operations: Dependence Unions
Task A Task B Result
X → Y X → Y X → Y (Same dependence)
X → Y Null X → Y (First dominates)
X → Y Y → X Undefined (or Null)
X → Y: X is progress dependent on Y
Slide 18/40
Example of Distributed Algorithm to Infer Dependence Graph
1
2
3
4
(2) Send only non-null dependencies2→ 1 3→ 1
3→ 24→ 1
2→ 13→ 13→ 24→ 1
(3) Build progress-dependence graph
1 2 3 4 Step
1→ 21→ 31→ 4
2→ 12→ 32→ 4
3→ 13→ 23→ 4
4→ 14→ 24→ 3
XXX
XX X
XX
(1) Create dependencies locally
MPI Tasks
11
22
33
44
Progress Dependence Graph
7/4/2013
10
Slide 19/40
Progress Dependence Graph of Bug
[3136]
[0, 2048,3072]
[1-2047,3073-3135,…] [6841-
7995]
[6840]
Hang with ~8,000 MPI tasks in BlueGene/L
Our tool finds that MPI task 3136 is the origin of the hang• How did it reach its current state?
[3136] Least-progressed task
Slide 20/40
Finding the Faulty Code Region: Program Slicing
Task 1
Task 2
Task 3 Task 4
done = 1;
for (...) {if (event) {
flag = 1;}
}
if (flag == 1) {MPI_Recv();...
}...if (done == 1) {MPI_Barrier();
}
Progress dependence
graph
StateTask 1 State
StateTask 2 State
7/4/2013
11
Slide 21/40
Slice with Origin of the Bug
dataWritten = 0for (…) {
MPI_Probe(…, &flag,…)if (flag == 1) {
MPI_Recv()MPI_Send()dataWritten = 1
}MPI_Send() MPI_Recv()// Write data
}if (dataWritten == 0) {
MPI_Recv()MPI_Send()
}Reduce()Barrier()
Least-progressed task State
Dual condition occurs• A task is a writer and a non-writer at
the same time
MPI_Probe checks for source, tag and comm of a message• Another writer intercepted wrong
message
Programmer used unique MPI tags to isolate different I/O groups
Slide 22/40
Controlled Evaluation
• Used two Sequoia benchmarks (AMG, LAMMPS) and six NAS Parallel benchmarks
• Faults injected in two Sequoia benchmarks:– AMG-2006 and LAMMPS– Injected a hang in random MPI tasks– Only injected in executed functions (MPI and user
functions)
• Perform slowdown and memory usage evaluation in all benchmarks
7/4/2013
12
Slide 23/40
Accurate Detection of Least-Progress Tasks
• Least-progressed task detection recall:– Cases when LP task is detected correctly
• Imprecision:– % of extra tasks in LP tasks set
[3]
[1,5,…] [2,4,…]
[0,6-8,…]
[3, 5, 4]
[1,9,…] [27,…]
Example 1
LP task detectedImprecision = 0
Example Runs: 64 tasks, fault injected in task 3
Example 2
LP task detectedImprecision = 2/3
• Overall results:– Average LP task detection recall is 88%– 86% of injections have imprecision of zero
Slide 24/40
Performance Results
Least-Progress Task Detection Takes a Fraction of a Second
AMG2006 LAMMPS
7/4/2013
13
Slide 25/40
Performance Results: Slowdown is Small For a Variety of Benchmarks
• Tested slowdown with NAS Parallel and Sequoia benchmarks– Maximum slowdown of ~1.67
• Slowdown depends on number of MPI calls from different contexts
Slide 26/40
Remaining Agenda
Problem Localization
Distributed Applications
Scientific ApplicationsMPI, OpenMP
CommercialApplicationsBigData, Java
Related Work and Conclusions
Scientific ApplicationsMPI, OpenMP
CommercialApplicationsBigData, Java
PACT 2012
7/4/2013
14
Slide 27/40
Commercial Applications Generate Many Metrics
How can we use these metrics to localize the root cause of problems?
MiddlewareVirtual machines and containers
statistics
Operating SystemCPU, memory, I/O, network
statistics
HardwareCPU performance counters
ApplicationRequests rate, transactions, DB
reads/writes, etc..
Tivoli Operations Manager
Slide 28/40
Research Objectives
• Look for abnormal time patterns• Pinpoint code regions that are correlated these abnormal
patterns
… Code Region
Program
Code Region
Metric 1
Metric 2
Metric 3
Metric 100
Abnormal code
blocks
7/4/2013
15
Slide 29/40
Bugs Cause Metric Correlations to Break
• Hadoop DFS file-descriptor leak in version 0.17 (2008)
• Correlations are different when the bug manifests itself:– Metrics: open file descriptors, characters written to disk
This image cannot currently be displayed.
Normal Run Failed Run
Correlations are different
Slide 30/40
Approach Overview
Find Abnormal Windows
Find Abnormal Metrics
Find Abnormal Code Regions
Normal Run
Failed Run
7/4/2013
16
Slide 31/40
Selecting Abnormal Window via Nearest-Neighbor (NN)
Normal Run Faulty Run
3, 55, 47, 0.7,…2, 54, 45, 0.8,…
3, 55, 47, 0.7,…2, 55, 45, 0.6,…
Traces
… …
▪ Sample of all metrics▪ Annotated with code region
Model Checking- C. Killian (NSDI’07)- J. Yang - Modist (NSDI’09)- H. Guo (SOSP’11)- Cmc, M. S. Musuvathi (OSDI’02)
Failure Prediction
- I. Cohen (OSDI’04)- Tiresias, (IPDPS ’07)- A. Gainaru, prediction in HPC (SC’12)
Logs and Metrics Analysis- K. Ozonat (DSN’08)- I. Cohen (OSDI’04)- P. Bodik (EuroSys’10)- K. Nagaraj (NSDI’12)
Slide 38/40
Conclusion
• Fault detection and diagnosis can be scalable– Use of “computationally cheap” models– Can diagnose problems with 100,000 parallel tasks– Slowdown ~ 1.7 times application run time
• Techniques tested in real-world bugs and fault injections– Molecular-dynamics code bug @ LLNL– NAS Parallel benchmarks, Sequoia benchmarks– Commercial application bugs: Hadoop, Hbase, ITAP and IBM app.
• Diagnosis takes less time than traditional debuggers– Detection of least-progress task takes less than a second– Code regions where bugs manifest themselves are highlighted
7/4/2013
20
Slide 39/40
Lessons Learned
• Different kinds of machine learning algorithms are good for different problems– Algorithms that are fast in testing phase are appropriate for HPC
• Finding the right kind of instrumentation is extremely important– Too much: not scalable and too much slowdown– Too few: not enough data to train statistical models
• Problem determination at a line-of-code granularity is challenging– But code-region granularity works well for many failures
Slide 40/40
Thanks to Contributors!
Purdue University
Lawrence Livermore National Laboratory
Prof. SaurabhBagchi
Nawanol Theera-Ampornpunt
Fahad A. Arshad
Bronis R. de Supinski
This image cannot currently be displayed. This image cannot currently be displayed.
Prof. Samuel Midkiff
This image cannot currently be displayed. This image cannot currently be displayed.
This image cannot currently be displayed.
Greg Bronevetsky
This image cannot currently be displayed.
Martin Schulz
This image cannot currently be displayed.
Todd Gamblin
This image cannot currently be displayed. This image cannot currently be displayed.
Dong Ahn
7/4/2013
21
Slide 41/40
Thank you!
Slide 42/40
Backup Slides
7/4/2013
22
Slide 43/40
Future Work
• Use of more complex dependencies between metrics– Non-linear dependencies
• Apply failure prediction techniques in HPC applications– Via analysis of metrics or system/application log analysis
• More general strategy for creating task’s state-machine (i.e., Markov model)– Sampling of user-level functions
• What metrics are useful in fault detection and diagnosis?– Are hardware metrics useful? (e.g., hardware counters)
• Handling failures in HPC systems from the application– Instead of killing all the processes, let the application continue with
healthy processes
Slide 44/40
What if we have different Markov Models in different tasks?
• First, dependencies are built locally base on local information• Second, dependence unions (in the distributed reduction) take
care of null (or undefined) dependencies.
a
b
c
e
d
1
2 3
Global View of Markov Model
a
b
c
e
2
As seen from Task 2
a
b
d 3
As seen from Task 3
d c
Dependencies:2 → 12 → 3 (undefined)
3 2
1 1
X
Dependencies:3 → 13 → 2 (undefined)X
Result of Dependence Reduction:2 → 13 → 1
7/4/2013
23
Slide 45/40
Binomial Tree Implementation of MPI_Reduce
1 2 3 4 5 6 7 8Iteration 1
Iteration 2
Iteration 3
Code region 1
Code region 2: MPI_Reduce
Code region 3
Task 5 blocks here
Tasks 1, 6, 7 are progress dependent on 5
Tasks 2, 3, 4, 8 move to the next code region
Slide 46/40
Bug (Case Study)
RR RRWW
RRRR
I/O group 1
RR RRRR
RRWW
I/O group 2
Same task
Bug: dual condition (a task is reader and a writer for different I/O groups)
R: readerW: writer
Same message tags are used even in different groups
R R R W
I/O I/O
R R R R R R R R R W
W
WBlueGene/LCompute nodes perform I/O via dedicated I/O nodes
R R R W
R R R R R R R R R W
Linux cluster
Dual condition
7/4/2013
24
Slide 47/40
Fault Injection Results for LAMMPS ApplicationThis image cannot currently be displayed.
• 88% of the time the least-progress task is detected.
• Every time is not detected, it’s isolated
• 86% of injections has imprecision of zero
Slide 48/40
Sample Results of the Tool
This image cannot currently be displayed.
7/4/2013
25
Slide 49/40
Performance Results: Least-Progress Task Detection Takes a Fraction of a
• Regression-test system for IBM Full System Simulator (Mambo)– Mambo: arch. simulator for systems based on IBM’s Power(TM)
• Example of typical failures:– Problem with the simulated architecture– NFS connection fails intermittently– Failed LDAP server authentications– /tmp filling up
Focus of experiments:- Fault injection
7/4/2013
28
Slide 55/40
Case 3: MHM Results
Abnormal metrics are correlated with the failure origin: NFS connection
Abnormal code regions given by the tool
Where the problem occurs
• Abnormal code-region is selected almost correctly– Asynchronous profiling technique cause inaccuracies