Top Banner
Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega
28

Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Dec 17, 2015

Download

Documents

Wesley Allen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems

Leonardo R. Bachega

Page 2: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Papers

1. Problem Diagnosis in Large-Scale Computing Environments, A. Mirgorodskiy, N. Maruyama, Barton Miller, SC 2006;

2. DMTracker: Finding Bugs in Large-Scale Parallel Programs by Detecting Anomaly in Data Movements, Q. Gao, F. Qin, D. Panda, SC 2007.

Page 3: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Motivation for the Papers

• Debugging is a very hard task– ½ of the development time in sequential

applications– Problem gets magnified in systems with

hundreds of processes

• Massively parallel systems becoming popular– How do we make parallel debugging easier by

leveraging statistical bug detection techniques?

Page 4: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

BackgroundStatistical Techniques

– Explore properties likely to hold at certain program points– Run-time information collected in traces– Empirical Execution models (profiles): Built from trace

information– Find similarities (and dissimilarities) between profiles– Classification into groups– Outliers as suspects for buggy behavior– Assumption: Correct behavior is the common case, faulty

behavior is unusual - a deviation from the common case

Page 5: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Paper 1: Miller’s

Proc 1 Proc 2 Proc 3 … Proc N-1

Proc N

Processes performing similar tasksAnomalousbehavior

Page 6: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Paper’s Main Ideas• Unusual process behavior detection by

comparison with other processes

• “Control flow” trace collection– Function call information

• Per process trace analysis– Fail-stop: Processes that stop generating traces– Distance-based outlier detection: isolate processes

that behave differently (non-fail-stop)

Page 7: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Fault Model• Non-deterministic fail-stop failures

– failing process stop collecting traces earlier

• Infinite loops– process spends unusual amount of time in a

particular function

• Deadlock, livelock, starvation– deadlocked procs stop generating traces– Starving procs spend time in different parts than

procs with resources granted

• Load imbalance– Unusual little time spent on certain parts– Analyst identifies

Page 8: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Limitations of Fault Model

A problem that…

• Happens in all nodes is considered normal behavior

• Doesn’t change the ctrl flow is not detected

• Happens too early can’t be tracked since the trace collection is limited (can’t go too far back in history)

Page 9: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Finding Misbehaving Host• Earliest Last Timestamp

– Identifies host that stopped generating the trace– Fail-stop problems: crashes, infinite blocking– Assume global clock synchronization:

|Tmin – Tavg| > threshold

• Behavioral Outliers– Identify traces different from the rest– Distance-based outlier detection– Pair-wise distance between traces– Suspect score for each process

Page 10: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Profile’s distance metrics

Time spent at f1 in host h

If h and g are similar: each function will consume similar amounts of time on both hosts and d(g,h) will be low

Manhattan distance

Page 11: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Behavioral Outliers

• Consider all common behaviors as normal• Parameter k adjusts the common behavior• Score: high for outliers, low for common

behavior

K-nearest neighbor algorithm:

Page 12: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Finding Anomalies’ Causes• Last Trace Entry: function that failed

– Can be misleading– Solution: look at sequences of calls

• Max of Delta Vector: Function that differs most from the normal behavior (largest contribution to suspect score)

• Anomalous time interval: – partition traces from all hosts in short intervals– Apply outlier detection: identify earliest fragment

with outlier

Page 13: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Results• Network stability problem

– Fail-stop behavior– One node stops 500 seconds

earlier than others– Earliest timestamp approach

• Broadcast service– No fail-stop behavior– Suspect score from failed run

traces

Page 14: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Summary and Conclusions

• Trace analysis to explain failures in large-scale distributed systems

• Detect anomalies rather than massive failures

• Identify both fail-stop and non-fail-stop anomalous behavior

Page 15: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Paper 2: DMTracker

Proc 1 Proc 2 Proc 3 … Proc N-1

Proc N

Processes performing similar tasksAnomalousbehavior

Proc 1 Proc 2 Proc 3 … Proc N-1

Proc N

Processes performing similar tasks

Spatial Dissimilarity

Temporal Dissimilarity

Page 16: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Paper’s Main Ideas• Tracks abnormal behaviors in data movements (DM)• Works on Data movement chains: memory allocation,

copies, sends/receives• Extract DM-invariants and check for violation of these

invariants• Violations indicate potential bugs• Two types of invariants:

– Temporal: frequently occurring data movements (Frequent chain or FC)

– Spatial: clusters data movements across processes (Chain distribution or CD)

Page 17: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Data Movement ChainsSingle processor DMs Multi-processor DMs

Match Sends/Receives from processes’traces

Concatenation of memory operationsof a trace file

Page 18: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Key: Data Movement Chain

Normal ExecutionBuggy Execution

Page 19: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Data Movement-Based Invariants

• FC-invariant based: temporal similarity– Similar DM-chains occur many times during

execution– Large groups (frequently happening) of DM-chains

• CD-invariant based: spatial similarity– Processes perform similar or identical tasks– Chain distribution clusters as CD-invariants

Page 20: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

DMTracker: Design Overview

Function callsMemory mgmt: allocation/deallocationData Movement: copies/network operations

RecordsKey arguments / return valuesCall sitesThread IDsLocal timestamps

Correlates each operation to its source and destination

Page 21: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Invariants generation

• Groups formed by chains of same type

• Chains of same type have the same– call sites for individual DM operations– allocation call sites for source and destination

buffers

Page 22: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

FC-Invariants

• Two criteria for invariants– Chains in the group must happens frequently– Chain type of each group must be “unique”

• Uniqueness of chain: aggregation of uniqueness values of memory operations

Tunable parameters

# of segments of data

Page 23: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

FC-Invariant Anomaly Detection

• Abnormality of P compared to C based in

– Combined using harmonic mean:

• Threshold for abnormality is an adjustable parameter

)()()(

)(

)(

)(),(

PCUniquenessPUniquenessCUniqueness

PCUniqueness

PCUniqueness

PCUniquenessPCSimilarity

)(

)()(),(

PFrequency

CFrequencyPFrequencyPCRareness

Page 24: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

CD-Invariants• Clusters of chain distributions across

processes – one profile per trace (process)– DM chains in a particular trace– DM chains originated in a particular trace

• Profile: frequency of chains in a trace profile:

• K-nearest neighbor used to build invariants (clusters)

Total # of distinct chain groups

Total # of Chains in trace T

Total # of chains ofgroup C2 in trace T

Page 25: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

CD-Invariant Anomaly Detection

• Abnormal trace: distance to k-nearest neighbor exceeds threshold

• Exactly the same procedure as in paper1!

Page 26: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

DMTracker Results

• FC-Invariant (15,075 times) violated by similar chains: 154 times– All processes triggered the bug

• CD-Invariant: catches non-deterministic bug

Page 27: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

DMTracker Summary

• Data Movement chains derived from traces

• Frequency Chain and Chain Distribution invariants to capture temporal and spatial correlations in parallel system

• Study cases show bug detection

Page 28: Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

General Observations

• Use of spatial and temporal invariants• Detection of deviant behavior as opposed

to common behavior• Simple Machine Learning techniques

applied for data classification• Bug detection in large systems using

outlier detection• Very few results to support broad

conclusions about the effectiveness of the techniques