Top Banner
Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin Schulz Statistical Fault Detection and Analysis with AutomaDeD
28

Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

Jan 02, 2016

Download

Documents

Miranda Snow
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

Lawrence Livermore National Laboratory

Greg Bronevetskyin collaboration with

Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin Schulz

Statistical Fault Detection and Analysis with AutomaDeD

Page 2: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Reliability is a Critical Challenge in Large Systems

Need tools to detect faults, identify causes• Fault tolerance : requires fault detection• System management: need to know what failed

Faults come from various causes• Hardware: soft errors, marginal circuits, physical

degradation, design bugs• Software: coding bugs, misconfigurations

Page 3: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

In General Fault Detection and Fault Tolerance is Undecidable

Option 1: Make all applications fault resilient• Application-specific solutions hard to design• Many applications• How does fault resilience compose?

Option 2: Develop approximate fault detection, tolerate via checkpointing et al• Statistically model application behavior• Look for deviations from model behavior• Identify model components that likely caused

deviation

Page 4: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

In General Fault Detection and Fault Tolerance is Undecidable

Option 2: Develop approximate fault detection, tolerate via checkpointing et al• Statistically model application behavior• Look for deviations from model behavior• Identify model components that likely caused deviation

Application Model

Page 5: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Focus on Modeling Individual MPI Applications

Primary goal is fault detection for HPC applications• Model behavior of single MPI application• Detect deviations from norm• Identify origin of deviation in time/space

Other branches of field• Model system component interactions • Model application as dataflow graph of modules• Model micro-architecture state as vulnerable/non-

vulnerable (ACE analysis)

Page 6: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Goal: Detect Unusual Application Behavior, Identify Cause

. . . . . . . . .

Single Run - SpatialDifferences between behavior of processes

Single Run - TemporalDifferences between

one time point and others

Multiple RunsDifferences between

behavior of runs

MPI Application

Page 7: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Semi-Markov Models

SMM - Transition system Nodes: application states Edges: transitions from one state to another

• Probability of transition• Time spent in prior state before transition

.2 / 5μs

.7 / 15μs

.1 / 500μs

A

B

C

D

Page 8: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

SMMs Represent Application Control Flow

SMM states correspond to• Calls to MPI • Code Between MPI Calls

Computation

main()foo()Send-DBL

Computation

main()foo()Recv-DBL

Computation

main()Finalize

main()Initmain() { MPI_Init() … Computation … MPI_Send(…, 1, MPI_INTEGER, …); for(…) foo(); MPI_Recv(…, 1, MPI_INTEGER, …); MPI_Finalize();}

foo() { MPI_Send(…, 1024, MPI_DOUBLE, …); …Computation… MPI_Recv(…, 1024, MPI_DOUBLE, …); …Computation…}

Application Code Semi-Markov Model

main()Send-INT

main()Recv-INT

Different statefor different

calling context

Page 9: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Transitions Represent Time Spent at States

During execution each transition observed multiple timesTime series of transition times: [t1, t2, …, tn]

Represented as probability distribution• Gaussian• Histogram

.2 / 5μs

.7 / 15μs

.1 / 500μs

Page 10: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Transitions Represent Time Spent at States

Gaussian

Histogram

Time Values

HistogramBucket Counts

Gaussian Tail

Line Connectors

Time Values

Time Values

Probabilities

DataSamples

• Cheaper• Lower Accuracy

• More Expensive• Greater Accuracy

Page 11: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Using SMMs to Help Detect Faults

Hardware faults → behavior abnormalities Given sample runs, learn time distribution on

each transition (Top and bottom 0% or 10% of each

transition’s times removed)

If some transition takes an unusual amount of time, declare it an error

Time Values

Probabilities

Page 12: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Detection threshold computed from maximum normal variation

Need threshold to separate normal, abnormal timing

Threshold = lowest probability observed in set of sample runs (Top and bottom 1% removed)

Time Values

Probabilities

Nothing Removed Top/Bottom 10% Removed

False Positive Rate 0% 19%

Page 13: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Evaluated Fault Detector Using Fault Injection

NAS Parallel Benchmarks• 16-process runs• Input class A• Used BT, CG, FT, MG,LU and SP

(EP and IS use MPI in very simple ways)

Local delays (FIN_LOOP): 1, 5, 10 sec MPI message drop (DROP_MESG) or repetition

(REP_MESG) Extra CPU-intensive (CPU_THR) or Memory-

intensive (MEM_THR) thread

Page 14: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Rates of Fault Detection Within 1ms of Injection

NoDetection

False DetectionBefore Injection

Detection of FaultWithin 1ms

DetectionAfter 1ms

sup

ervi

sed

sup

ervF

ilt(1

0)

sup

ervi

sed

sup

ervF

ilt(1

0)

sup

ervi

sed

sup

ervF

ilt(1

0)

sup

ervi

sed

sup

ervF

ilt(1

0)

sup

ervi

sed

sup

ervF

ilt(1

0)

sup

ervi

sed

sup

ervF

ilt(1

0)

sup

ervi

sed

sup

ervF

ilt(1

0)

FIN_LOOP-1 FIN_LOOP-5 FIN_LOOP-10 DROP_MESG REP_MESG CPU_THR MEM_THR

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100% noDetect

preDetect

Fault Detect(1ms)

postDetect

Filtering Usually Improves

Detection Rates

Single-Point EventsEasier to Detect ThanPersistent Changes

Page 15: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

SMMs used to Help Identify Software Faults in MPI Applications

User knows application has fault but needs help to focus on cause

Help identify point where fault first manifests as change in application behavior

Key tasks on faulty run:• Identify time period of manifestation• Identify task where fault first manifested• Identify code region where fault first manifested

Page 16: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Focus on the Time Period of Unusual Behavior

User marks phase boundaries in code Compute SMM for each task/phase

Task 1

Task 2

Task n

. . .

Task 1

Task 2

Task n

. . .

Task 1

Task 2

Task n

Task 1

Task 2

Task n

. . .

Task 1

Task 2

Task n

. . .

Page 17: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Focus on the Time Period of Abnormal Behavior

Find phase with most unusual SMMs If sample runs available, compare faulty run’s

SMMs to sample runs’ SMMs

If none available, compare each phase to others

. . .

Faulty Run Sample Run

Page 18: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Cluster Tasks According to Behavior to Identify Abnormal Task

User provides application’s natural cluster count k Use sample execution to compute clustering

threshold τ that produces k clusters• Use sample runs if available• Otherwise, compute τ from start of execution

During real runs cluster tasks using threshold τ

Task 1 Task 2 Task n

. . .

Task 3

Task 4 Task 5 Task 6

Task 7 Task 8 Task 9

Task 1

Task 2

Master-Worker

Task 3

Task 4 Task 5 Task 6

Task 7 Task 8 Task 9

Task 1

Task 2

Bug in Task 9

Page 19: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Cluster Tasks According to Behavior to Identify Abnormal Task

Compare tasks in each cluster to their behavior in• Sample runs• Start of execution

Most abnormal is identified

Transition most responsible for difference identified as origin

Task 3

Task 4 Task 5 Task 6

Task 7 Task 8 Task 9

Task 1

Task 2

Bug in Task 9

Page 20: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Evaluated Fault Detector Using Fault Injection

NAS Parallel Benchmarks• 16-task, Class A: BT, CG, FT, MG,LU and SP

2000 injection experiments per application• Local livelock/deadlock (FIN_LOOP, INF_LOOP)• Message drop (DROP_MESG), repetition (REP_MESG) • CPU-intensive (CPU_THR) or Memory-intensive

(MEM_THR) thread Examined variants of training runs

• 20 training runs with no faults• 20 training runs, 10% have fault• No training runs

Page 21: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Phase Detection Accuracy

Accuracy ~90% for Loops and Message drops, ~60% for Extra threads• Training significantly better than no training

(10% bug training is close)• Histograms better than Gaussians

Fault10 - Gauss

Fault10 - Histogram

NoFault - Gauss

NoFault - Histogram

NoSample - Gauss

NoSample - Histogram

CPU_THR

MEM

_THR

DROP_MSG

REP_MSG

FIN_L

OOP-1

FIN_L

OOP-5

FIN_L

OOP-10

INF_L

OOP

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Training vsNo TrainingNoFault Sample

vs Some FaultsGaussian vs Histogram

Page 22: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Cluster Isolation Accuracy

Results assume phase detected accurately Accuracy of Cluster Isolation highly variable

• Depends on propagation of fault’s effects• Accuracy upto 90% for extra threads• Poor detection

elsewhere sinceno informationon event timing

BT

CG

FT

LU

MG

SP

CPU_THR

MEM

_THR

DROP_M

SG

REP_MSG

FIN_L

OO

P-1

FIN_L

OO

P-5

FIN_L

OO

P-10

INF_L

OO

P

0%

20%

40%

60%

80%

100%

Page 23: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Cluster Isolation Accuracy

Extended cluster isolation with information on event order

Focuses on first abnormal transition Significantly better accuracy for loop faults

BT

CG

FT

LU

MG

SP

CPU_THR

MEM

_THR

DROP_M

SG

REP_MSG

FIN_L

OO

P-1

FIN_L

OO

P-5

FIN_L

OO

P-10

INF_L

OO

P

0%

20%

40%

60%

80%

100%Abnormal Transition - Fault10

Page 24: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Transition Isolation

Accuracy: injected transition in top 5 candidates• Accuracy ~90% for Loop faults• Highly variable for others• Less variable if event order information is used

CPU_THR

MEM

_THR

DROP_M

SG

REP_MSG

FIN_L

OO

P-1

FIN_L

OO

P-5

FIN_L

OO

P-10

INF_L

OO

P

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

BT

CG

FT

LU

MG

SP

Page 25: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Abnormality Detection Helps Illuminate MVAPICH Bug

Job execution script failed clean up at job end, left runaway processes on nodes

Simulated by executing BT (16- and 64-task runs) concurrently with LU, MG or SP (16-task runs)

Experiments show • Average SMM difference in regular BT runs• Difference between BT runs with interference

and no-interference runs• Overlap execution during initial portion of BT run

Page 26: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Abnormality Detection Helps Illuminate MVAPICH Bug

Experiments show • Average SMM difference in regular BT runs• Difference between BT runs with interference

and no-interference runs

1 2 3 4 5 6 7 8 9 101E+1

1E+2

1E+3

1E+4

1E+516-task BT / 16-task SP/LU/MG

Phase

SM

M D

evia

tio

n S

core

AVG No-InterferenceConcurrent SPConcurrent LUConcurrent MG

Page 27: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Abnormality Detection Helps Illuminate MVAPICH Bug

Experiments show • Average SMM difference in regular BT runs• Difference between BT runs with interference

and no-interference runs

1 2 3 4 5 6 7 8 9 101E+2

1E+3

1E+4

1E+564-task BT / 16-task SP/LU/MG

Phase

SM

M D

evia

tio

n S

core

AVG No-InterferenceConcurrent SPConcurrent LUConcurrent MG

Page 28: Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

LLNL-PRES-424502 Option:Additional Information

Lawrence Livermore National Laboratory

Behavior Modeling is Critical Component of Fault Detection and Analysis

Complex behavior of applications and systems Statistical models provide accurate summary Promising results

• Quick detection of faults• Focused localization of root causes

Ongoing work• Scaling implementations to real HPC systems• Improving accuracy through

More data Models custom-tailored to applications