Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

Lawrence Livermore National Laboratory

Greg Bronevetskyin collaboration with

Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin Schulz

Statistical Fault Detection and Analysis with AutomaDeD

LLNL-PRES-424502 Option:Additional Information


Reliability is a Critical Challenge in Large Systems

Need tools to detect faults, identify causes• Fault tolerance : requires fault detection• System management: need to know what failed

Faults come from various causes• Hardware: soft errors, marginal circuits, physical

degradation, design bugs• Software: coding bugs, misconfigurations



In General Fault Detection and Fault Tolerance is Undecidable

Option 1: Make all applications fault resilient• Application-specific solutions hard to design• Many applications• How does fault resilience compose?

Option 2: Develop approximate fault detection, tolerate via checkpointing et al• Statistically model application behavior• Look for deviations from model behavior• Identify model components that likely caused

deviation



In General Fault Detection and Fault Tolerance is Undecidable

Option 2: Develop approximate fault detection, tolerate via checkpointing et al• Statistically model application behavior• Look for deviations from model behavior• Identify model components that likely caused deviation

Application Model



Focus on Modeling Individual MPI Applications

Primary goal is fault detection for HPC applications• Model behavior of single MPI application• Detect deviations from norm• Identify origin of deviation in time/space

Other branches of field• Model system component interactions • Model application as dataflow graph of modules• Model micro-architecture state as vulnerable/non-

vulnerable (ACE analysis)



Goal: Detect Unusual Application Behavior, Identify Cause

. . . . . . . . .

Single Run - SpatialDifferences between behavior of processes

Single Run - TemporalDifferences between

one time point and others

Multiple RunsDifferences between

behavior of runs

MPI Application



Semi-Markov Models

SMM - Transition system Nodes: application states Edges: transitions from one state to another

• Probability of transition• Time spent in prior state before transition

.2 / 5μs

.7 / 15μs

.1 / 500μs

A

B

C

D



SMMs Represent Application Control Flow

SMM states correspond to• Calls to MPI • Code Between MPI Calls

Computation

main()foo()Send-DBL

Computation

main()foo()Recv-DBL

Computation

main()Finalize

main()Initmain() { MPI_Init() … Computation … MPI_Send(…, 1, MPI_INTEGER, …); for(…) foo(); MPI_Recv(…, 1, MPI_INTEGER, …); MPI_Finalize();}

foo() { MPI_Send(…, 1024, MPI_DOUBLE, …); …Computation… MPI_Recv(…, 1024, MPI_DOUBLE, …); …Computation…}

Application Code Semi-Markov Model

main()Send-INT

main()Recv-INT

Different statefor different

calling context



Transitions Represent Time Spent at States

During execution each transition observed multiple timesTime series of transition times: [t1, t2, …, tn]

Represented as probability distribution• Gaussian• Histogram

.2 / 5μs

.7 / 15μs

.1 / 500μs



Transitions Represent Time Spent at States

Gaussian

Histogram

Time Values

HistogramBucket Counts

Gaussian Tail

Line Connectors

Time Values

Time Values

Probabilities

DataSamples

• Cheaper• Lower Accuracy

• More Expensive• Greater Accuracy



Using SMMs to Help Detect Faults

Hardware faults → behavior abnormalities Given sample runs, learn time distribution on

each transition (Top and bottom 0% or 10% of each

transition’s times removed)

If some transition takes an unusual amount of time, declare it an error

Time Values

Probabilities



Detection threshold computed from maximum normal variation

Need threshold to separate normal, abnormal timing

Threshold = lowest probability observed in set of sample runs (Top and bottom 1% removed)

Time Values

Probabilities

Nothing Removed Top/Bottom 10% Removed

False Positive Rate 0% 19%



Evaluated Fault Detector Using Fault Injection

NAS Parallel Benchmarks• 16-process runs• Input class A• Used BT, CG, FT, MG,LU and SP

(EP and IS use MPI in very simple ways)

Local delays (FIN_LOOP): 1, 5, 10 sec MPI message drop (DROP_MESG) or repetition

(REP_MESG) Extra CPU-intensive (CPU_THR) or Memory-

intensive (MEM_THR) thread



Rates of Fault Detection Within 1ms of Injection

NoDetection

False DetectionBefore Injection

Detection of FaultWithin 1ms

DetectionAfter 1ms

sup

ervi

sed

sup

ervF

ilt(1

0)

sup

ervi

sed

sup

ervF

ilt(1

0)

sup

ervi

sed

sup

ervF

ilt(1

0)

sup

ervi

sed

sup

ervF

ilt(1

0)

sup

ervi

sed

sup

ervF

ilt(1

0)

sup

ervi

sed

sup

ervF

ilt(1

0)

sup

ervi

sed

sup

ervF

ilt(1

0)

FIN_LOOP-1 FIN_LOOP-5 FIN_LOOP-10 DROP_MESG REP_MESG CPU_THR MEM_THR

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100% noDetect

preDetect

Fault Detect(1ms)

postDetect

Filtering Usually Improves

Detection Rates

Single-Point EventsEasier to Detect ThanPersistent Changes



SMMs used to Help Identify Software Faults in MPI Applications

User knows application has fault but needs help to focus on cause

Help identify point where fault first manifests as change in application behavior

Key tasks on faulty run:• Identify time period of manifestation• Identify task where fault first manifested• Identify code region where fault first manifested



Focus on the Time Period of Unusual Behavior

User marks phase boundaries in code Compute SMM for each task/phase

Task 1

Task 2

Task n

. . .

Task 1

Task 2

Task n

. . .

Task 1

Task 2

Task n

Task 1

Task 2

Task n

. . .

Task 1

Task 2

Task n

. . .



Focus on the Time Period of Abnormal Behavior

Find phase with most unusual SMMs If sample runs available, compare faulty run’s

SMMs to sample runs’ SMMs

If none available, compare each phase to others

. . .

Faulty Run Sample Run



Cluster Tasks According to Behavior to Identify Abnormal Task

User provides application’s natural cluster count k Use sample execution to compute clustering

threshold τ that produces k clusters• Use sample runs if available• Otherwise, compute τ from start of execution

During real runs cluster tasks using threshold τ

Task 1 Task 2 Task n

. . .

Task 3

Task 4 Task 5 Task 6


Task 1

Task 2

Master-Worker

Task 3



Task 1

Task 2

Bug in Task 9



Cluster Tasks According to Behavior to Identify Abnormal Task

Compare tasks in each cluster to their behavior in• Sample runs• Start of execution

Most abnormal is identified

Transition most responsible for difference identified as origin

Task 3



Task 1

Task 2

Bug in Task 9



Evaluated Fault Detector Using Fault Injection

NAS Parallel Benchmarks• 16-task, Class A: BT, CG, FT, MG,LU and SP

2000 injection experiments per application• Local livelock/deadlock (FIN_LOOP, INF_LOOP)• Message drop (DROP_MESG), repetition (REP_MESG) • CPU-intensive (CPU_THR) or Memory-intensive

(MEM_THR) thread Examined variants of training runs

• 20 training runs with no faults• 20 training runs, 10% have fault• No training runs



Phase Detection Accuracy

Accuracy ~90% for Loops and Message drops, ~60% for Extra threads• Training significantly better than no training

(10% bug training is close)• Histograms better than Gaussians

Fault10 - Gauss

Fault10 - Histogram

NoFault - Gauss

NoFault - Histogram

NoSample - Gauss

NoSample - Histogram

CPU_THR

MEM

_THR

DROP_MSG

REP_MSG

FIN_L

OOP-1

FIN_L

OOP-5

FIN_L

OOP-10

INF_L

OOP

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Training vsNo TrainingNoFault Sample

vs Some FaultsGaussian vs Histogram



Cluster Isolation Accuracy

Results assume phase detected accurately Accuracy of Cluster Isolation highly variable

• Depends on propagation of fault’s effects• Accuracy upto 90% for extra threads• Poor detection

elsewhere sinceno informationon event timing

BT

CG

FT

LU

MG

SP

CPU_THR

MEM

_THR

DROP_M

SG

REP_MSG

FIN_L

OO

P-1

FIN_L

OO

P-5

FIN_L

OO

P-10

INF_L

OO

P

0%

20%

40%

60%

80%

100%



Cluster Isolation Accuracy

Extended cluster isolation with information on event order

Focuses on first abnormal transition Significantly better accuracy for loop faults

BT

CG

FT

LU

MG

SP

CPU_THR

MEM

_THR

DROP_M

SG

REP_MSG

FIN_L

OO

P-1

FIN_L

OO

P-5

FIN_L

OO

P-10

INF_L

OO

P

0%

20%

40%

60%

80%

100%Abnormal Transition - Fault10



Transition Isolation

Accuracy: injected transition in top 5 candidates• Accuracy ~90% for Loop faults• Highly variable for others• Less variable if event order information is used

CPU_THR

MEM

_THR

DROP_M

SG

REP_MSG

FIN_L

OO

P-1

FIN_L

OO

P-5

FIN_L

OO

P-10

INF_L

OO

P

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

BT

CG

FT

LU

MG

SP



Abnormality Detection Helps Illuminate MVAPICH Bug

Job execution script failed clean up at job end, left runaway processes on nodes

Simulated by executing BT (16- and 64-task runs) concurrently with LU, MG or SP (16-task runs)

Experiments show • Average SMM difference in regular BT runs• Difference between BT runs with interference

and no-interference runs• Overlap execution during initial portion of BT run





and no-interference runs

1 2 3 4 5 6 7 8 9 101E+1

1E+2

1E+3

1E+4

1E+516-task BT / 16-task SP/LU/MG

Phase

SM

M D

evia

tio

n S

core

AVG No-InterferenceConcurrent SPConcurrent LUConcurrent MG





and no-interference runs

1 2 3 4 5 6 7 8 9 101E+2

1E+3

1E+4

1E+564-task BT / 16-task SP/LU/MG

Phase

SM

M D

evia

tio

n S

core

AVG No-InterferenceConcurrent SPConcurrent LUConcurrent MG



Behavior Modeling is Critical Component of Fault Detection and Analysis

Complex behavior of applications and systems Statistical models provide accurate summary Promising results

• Quick detection of faults• Focused localization of root causes

Ongoing work• Scaling implementations to real HPC systems• Improving accuracy through

More data Models custom-tailored to applications

Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

Documents

foo mpi

mpi code

approximate fault detection

init computation mpi

application statesedges

unusual application

prior state

causesfault tolerance