Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick 1 Jiaqi Tan 2 , Rajeev Gandhi 1 , Priya Narasimhan 1 1 Carnegie Mellon University 2 DSO National Labs, Singapore Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 1
56
Embed
Black-Box Problem Diagnosis in Parallel File Systems · Black-Box Problem Diagnosis in Parallel File Systems Michael P. Kasick1 Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1 1Carnegie
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Black-Box Problem Diagnosisin Parallel File Systems
Michael P. Kasick1
Jiaqi Tan2, Rajeev Gandhi1, Priya Narasimhan1
1Carnegie Mellon University 2DSO National Labs, Singapore
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 1
Problem Diagnosis Goals
To diagnose problems in off-the-shelf parallel file systemsEnvironmental performance problems: disk & network faultsTarget file systems: PVFS & Lustre
To develop methods applicable to existing deploymentsApplication transparency: avoid code-level instrumentationMinimal overhead, training, and configurationSupport for arbitrary workloads: avoid models, SLOs, etc.
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 2
Motivation: Real Problem Anecdotes
Problems motivated by PVFS developers’ experiencesFrom Argonne’s Blue Gene/P PVFS cluster
“Limping-but-alive” server problems
No errors reported, can’t identify faulty node with logsSingle faulty server impacts overall system performance
Storage-related problems:
Accidental launch of rogue processes, decreases throughputBuggy RAID controller issues patrol reads when not at idle
Network-related problems:
Faulty-switch ports corrupt packets, fail CRC checksOverloaded switches drop packets but pass diagnostic tests
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3
Motivation: Real Problem Anecdotes
Problems motivated by PVFS developers’ experiencesFrom Argonne’s Blue Gene/P PVFS cluster
“Limping-but-alive” server problemsNo errors reported, can’t identify faulty node with logsSingle faulty server impacts overall system performance
Storage-related problems:
Accidental launch of rogue processes, decreases throughputBuggy RAID controller issues patrol reads when not at idle
Network-related problems:
Faulty-switch ports corrupt packets, fail CRC checksOverloaded switches drop packets but pass diagnostic tests
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3
Motivation: Real Problem Anecdotes
Problems motivated by PVFS developers’ experiencesFrom Argonne’s Blue Gene/P PVFS cluster
“Limping-but-alive” server problemsNo errors reported, can’t identify faulty node with logsSingle faulty server impacts overall system performance
Storage-related problems:Accidental launch of rogue processes, decreases throughputBuggy RAID controller issues patrol reads when not at idle
Network-related problems:
Faulty-switch ports corrupt packets, fail CRC checksOverloaded switches drop packets but pass diagnostic tests
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3
Motivation: Real Problem Anecdotes
Problems motivated by PVFS developers’ experiencesFrom Argonne’s Blue Gene/P PVFS cluster
“Limping-but-alive” server problemsNo errors reported, can’t identify faulty node with logsSingle faulty server impacts overall system performance
Storage-related problems:Accidental launch of rogue processes, decreases throughputBuggy RAID controller issues patrol reads when not at idle
Network-related problems:Faulty-switch ports corrupt packets, fail CRC checksOverloaded switches drop packets but pass diagnostic tests
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 3
Outline
1 Introduction
2 Experimental Methods
3 Diagnostic Algorithm
4 Results
5 Conclusion
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 4
Target Parallel File Systems
Aim to support I/O-intensive applicationsProvide high-bandwidth, concurrent access
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 5
Parallel File System Architecture
network
clients
I/O servers
ios0 ios1 ios2 iosN mds0 mdsM
metadata
servers
One or more I/O and metadata serversClients communicate with every server
No server-server communication
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 6
Parallel File System Data Striping
…543210Logical File:
Server 1 0 3 6 …
Server 2 1 4 7 …
Server 3 2 5 8 …
PhysicalFiles
Client stripes local file into 64 kB–1 MB chunksWrites to each I/O server in round-robin order
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 7
Parallel File Systems: Empirical Insights (I)
Server behavior is similar for most requestsLarge requests are striped across all serversSmall requests, in aggregate, equally load all servers
Hypothesis: Peer-similarityFault-free servers exhibit similar performance metricsFaulty servers exhibit dissimilarities in certain metricsPeer-comparison of metrics identifies faulty node
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 8
Example: Disk-Hog Fault
0 200 400 600
020
000
6000
010
0000
Elapsed Time (s)
Sec
tors
Rea
d (/s
) Faulty server
Non-faulty servers
Peer-asymmetry
Strongly motivates peer-comparison approachMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 9
Parallel File Systems: Empirical Insights (II)
Faults manifest asymmetrically only on some metricsEx: A disk-busy fault manifests . . .
Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free)Symmetrically on throughput metrics (↓ on all nodes)
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10
Parallel File Systems: Empirical Insights (II)
Faults manifest asymmetrically only on some metricsEx: A disk-busy fault manifests . . .
Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free)
Symmetrically on throughput metrics (↓ on all nodes)
0 200 400 600 800
Elapsed Time (s)
I/O W
ait T
ime
(ms)
010
0020
00 Faulty serverNon−faulty servers
Peer-asymmetry
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10
Parallel File Systems: Empirical Insights (II)
Faults manifest asymmetrically only on some metricsEx: A disk-busy fault manifests . . .
Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free)Symmetrically on throughput metrics (↓ on all nodes)
0 200 400 600 800
Elapsed Time (s)
Sec
tors
Rea
d (/s
)
040
000
8000
0
Faulty serverNon−faulty servers
No asymmetry
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10
Parallel File Systems: Empirical Insights (II)
Faults manifest asymmetrically only on some metricsEx: A disk-busy fault manifests . . .
Asymmetrically on latency metrics (↑ on faulty, ↓ on fault-free)Symmetrically on throughput metrics (↓ on all nodes)
Faults distinguishable by which metrics are peer-divergent
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 10
Outline
1 Introduction
2 Experimental Methods
3 Diagnostic Algorithm
4 Results
5 Conclusion
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 11
System Model
Fault Model:Non-fail-stop problems
“Limping-but-alive” performance problems
Problems affecting storage & network resources
Assumptions:Hardware is homogeneous, identically configuredWorkloads are non-pathological (balanced requests)Majority of servers exhibit fault-free behavior
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 12
Instrumentation
Sampling of storage & network performance metricsSampled from /proc once every secondGathered from all server nodes
Storage-related metrics of interest:Throughput: Bytes read/sec, bytes written/secLatency: I/O wait time
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 16
Outline
1 Introduction
2 Experimental Methods
3 Diagnostic Algorithm
4 Results
5 Conclusion
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 17
Diagnostic Algorithm
Phase I: Node IndictmentHistogram-based approach (for most metrics)Time series-based approach (congestion window)Both use peer-comparison to indict faulty node
Phase II: Root-Cause AnalysisAscribes to root cause based on affected metrics
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 18
Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers
Compute PDF of metric for each server over sliding windowCompute Kullback-Leibler divergence for each server pairFlag pair anomalous if its divergence exceeds thresholdFlag server if over half of its server pairs are anomalous
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19
Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers
Compute PDF of metric for each server over sliding window
Compute Kullback-Leibler divergence for each server pairFlag pair anomalous if its divergence exceeds thresholdFlag server if over half of its server pairs are anomalous
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19
Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers
Compute PDF of metric for each server over sliding windowCompute Kullback-Leibler divergence for each server pair
Flag pair anomalous if its divergence exceeds thresholdFlag server if over half of its server pairs are anomalous
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19
Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers
Compute PDF of metric for each server over sliding windowCompute Kullback-Leibler divergence for each server pairFlag pair anomalous if its divergence exceeds threshold
Flag server if over half of its server pairs are anomalous
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19
Phase I: Node Indictment (Histogram-Based)Peer-compare metric PDFs (histograms) across servers
Compute PDF of metric for each server over sliding windowCompute Kullback-Leibler divergence for each server pairFlag pair anomalous if its divergence exceeds thresholdFlag server if over half of its server pairs are anomalous
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 19
Threshold Selection
Fault-free training session (stress test)Run ddw, ddr, & postmark under fault-free conditionsFind minimum threshold that eliminates all anomalies
Histogram comparison uses per-server thresholdsCaptures performance profile of each serverImportant to train on each cluster & file system
Train on performance-stressing workloads onlyMetrics deviate most when servers are saturatedLess intense workloads have better coupled behavior
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 20
Example: PVFS Throughput (Disk-Hog Fault)
0 200 400 600
020
000
6000
010
0000
Elapsed Time (s)
Sec
tors
Rea
d (/s
)
Faulty serverNon−faulty servers
PVFS + disk-hog
PVFS only
Throughput diverges due to disk-hog on faulty serverMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 21
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 29
Summary
Problem diagnosis in parallel file systemsIllustrates use of OS-level metrics in diagnosisLeverages peer-comparison to identify faulty nodesDemonstrates root-cause analysis by metrics affected
Diagnosis method is applicable to existing deploymentsInstrumentation is minimally invasive, low overheadFault-free training with stress tests
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 30
Peer-Comparison Scalability
Number of comparisons: n(n−1)2 =⇒ O(n2)
Insight: Don’t need to compare one node against all
Proposed solution:Establish n−k partitions with k serversPerform peer-comparisons among servers in each partitionRepartition with a different grouping for each window
Solution comparisons: (n−k)k(k−1)2 =⇒ O(n)
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 31
Congestion Window Problem
No closely-coupled peer behaviorcwnd is intentionally noisy under normal conditionsSynchronized connections can’t fully use link capacityCan’t compare histograms, too much variance
Congestion window packet-loss heuristic:TCP responds to packet-loss by halving cwndExponential decay after multiple loss eventsLog scale: Each loss results in linear cwnd decrease
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 32
Time Series Comparison Example
0 200 400 600
25
1020
5010
0
Elapsed Time (s)
Seg
men
ts
0 200 400 600
25
1020
5010
0
Elapsed Time (s)
Seg
men
ts
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 33
Time Series Comparison Example
0 200 400 600
25
1020
5010
0
Elapsed Time (s)
Seg
men
ts
0 200 400 600
25
1020
5010
0
Elapsed Time (s)
Seg
men
ts
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 33
Time Series Comparison Example
Elapsed Time (s)
Seg
men
ts
100 200 300 400 500 600 700
25
1050
100 200 300 400 500 600 700
100 200 300 400 500 600 700
25
1050
25
1050
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 34
Time Series Comparison Example
Elapsed Time (s)
Seg
men
ts
100 200 300 400 500 600 700
25
1050
100 200 300 400 500 600 700
100 200 300 400 500 600 700
25
1050
25
1050
Michael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 34
Heterogeneous Hardware (ddr)
0 100 200 300 400 500 600
050
100
150
200
Elapsed Time (s)
I/O W
ait T
ime
(ms)
Disks are same model, have different performance profilesMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 35
Load Imbalances (postmark)
0 100 200 300 400 500 600 700
0e+0
01e
+05
2e+0
53e
+05
Elapsed Time (s)
Byt
es R
ecei
ved
(B/s
)
“/” on one metadata server, all path lookups go thereMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 36
Cross-Resource Influence (ddr)
0 200 400 600 800
510
2050
100
200
Elapsed Time (s)
Seg
men
ts
Faulty serverNon−faulty servers
Disk-busy effect on server cwnd, unintentional syncMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 37
Delayed ACKs (ddw)
0 200 400 600
0e+0
02e
+05
4e+0
56e
+05
8e+0
5
Elapsed Time (s)
Byt
es T
rans
mitt
ed (B
/s)
Faulty serverNon−faulty servers
Packet-loss fault may also deviate network throughputMichael P. Kasick Problem Diagnosis in Parallel File Systems February 24, 2010 38
Results: PVFS 10/10 Cluster
control diskhog diskbusy wnethog rnethog recvloss sendloss