This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
• Sophisticated detector for security, software bugs– Track object accessed, validate pointer accesses– Require full-program analysis, changes to binary
• Bad addresses from HW faults more obvious– Invalid pages, unallocated memory, etc.
• Low-cost out-of-bounds detector– Monitor boundaries of heap, stack, globals– Address beyond these bounds HW fault- SW communicates boundaries to HW- HW enforces checks on ld/st address
App CodeGlobals
Heap
Stack
Libraries
Empty
Reserved
App Address Space
17
Impact of Out-of-Bounds Detector
Lower potential SDC rate in server workloads– 39% lower for permanents, 52% for transients
For SPEC workloads, impact is on detection latency
Server Workloads SPEC Workloads
SWAT OoB SWAT OoBPermanents Transients
0%
20%
40%
60%
80%
100%
Potential SDC
Detect-Other
Detect-OoB
Masked
0.38 0.23 0.58 0.28
SWAT OoB SWAT OoBPermanents Transients
0%
20%
40%
60%
80%
100%
Inje
cted
Fau
lts
0.67 0.63 0.65 0.65
18
Application-Aware SDC Analysis
• Potential SDC undetected faults that corrupt app output• But many applications can tolerate faults
– Client may detect fault and retry request– Application may perform fault-tolerant computations
* E.g., Same cost place & route, acceptable PSNR, etc.Þ Not all potential SDCs are true SDCs
- For each application, define notion of fault tolerance
• SWAT detectors cannot detect such acceptable changesshould not?
19
Application-Aware SDCs for Server
• 46% of potential SDCs are tolerated by simple retry• Only 21 remaining SDCs out of 17,880 injected faults
– Most detectable through application-level validity checks
SWAT
SWAT + OoB
w/ app to
lerance
0102030405060
Permanent Faults
Num
ber o
f Fau
lts
34(0.38%)
21(0.23%)
12(0.13%)
SWAT
SWAT + OoB
w/ app to
leran
ce0
102030405060
Transient Faults52
(0.58%)
25(0.28%)
9(0.10%)
20
• Only 62 faults show >0% degradation from golden output• Only 41 injected faults are SDCs at >1% degradation
– 38 from apps we conservatively classify as fault intolerant* Chess playing apps, compilers, parsers, etc.
Application-Aware SDCs for SPEC
SWAT+OoB >0% >0.01% >1%0
10
20
30
40
50
60
70 Permanent Faults
Num
ber o
f Fau
lts
56(0.6%)
16(0.2%) 8
(0.1%)11
(0.1%)
SWAT+OoB >0% >0.01% >1%0
10
20
30
40
50
60
70Transient Faults
58(0.6%)
46(0.5%)
33(0.4%)
37(0.4%)
21
Reducing Potential SDCs further (future work)
• Explore application-specific detectors– Compiler-assisted invariants like iSWAT– Application-level checks
• Need to fundamentally understand why, where SWAT works– SWAT evaluation largely empirical– Build models to predict effectiveness of SWAT
* Develop new low-cost symptom detectors* Extract minimal set of detectors for given sets of faults* Reliability vs overhead trade-offs analysis
22
• SWAT relies on checkpoint/rollback for recovery• Detection latency dictates fault recovery
– Checkpoint fault-free fault recoverable • Traditional defn. = arch state corruption to detection• But software may mask some corruptions!• New defn. = Unmasked arch state corruption to detection
Reducing Detection Latency: New Definition
Bad SW state
New Latency
Bad arch state
Old latency
FaultDetection
Recoverablechkpt
Recoverablechkpt
23
Measuring Detection Latency
• New detection latency = SW state corruption to detection• But identifying SW state corruption is hard!
– Need to know how faulty value used by application– If faulty value affects output, then SW state corrupted
• Measure latency by rolling back to older checkpoints– Only for analysis, not required in real system
FaultDetection
Bad arch state Bad SW state
New latency
ChkptRollback &
Replay
SymptomChkpt Fault effectmasked
Rollback &Replay
24
Detection Latency - SPEC
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100% Permanent Faults in Server
Detection Latency (Instructions)
Det
ecte
d Fa
ults
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100% Transient Faults in Server
Old Latency SWAT
Detection Latency (Instructions)
Det
ecte
d Fa
ults
25
Detection Latency - SPEC
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100% Permanent Faults in Server
Detection Latency (Instructions)
Det
ecte
d Fa
ults
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100% Transient Faults in Server
New Latency SWAT
Old Latency SWAT
Detection Latency (Instructions)
Det
ecte
d Fa
ults
26
Detection Latency - SPEC
• Measuring new latency important to study recovery• New techniques significantly reduce detection latency
- >90% of faults detected in <100K instructions• Reduced detection latency impacts recoverability
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100% Permanent Faults in Server
Detection Latency (Instructions)
Det
ecte
d Fa
ults
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100% Transient Faults in Server
New Latency out-of-bounds
New Latency SWAT
Old Latency SWAT
Detection Latency (Instructions)
Det
ecte
d Fa
ults
27
Detection Latency - Server
• Measuring new latency important to study recovery• New techniques significantly reduce detection latency
- >90% of faults detected in <100K instructions• Reduced detection latency impacts recoverability
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100% Permanent Faults in Server
Detection Latency (Instructions)
Det
ecte
d Fa
ults
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100% Transient Faults in Server
New Latency out-of-bounds
New Latency SWAT
Old Latency SWAT
Detection Latency (Instructions)
Det
ecte
d Fa
ults
28
Implications for Fault Recovery
• Checkpointing– Record pristine arch state for recovery– Periodic registers snapshot, log memory writes
• I/O buffering– Buffer external events until known to be fault-free– HW buffer records device reads, buffers device writes
“Always-on” must incur minimal overhead
Checkpointing I/O buffering
Recovery
29
Overheads from Memory Logging
• New techniques reduce chkpt overheads by over 60%– Chkpt interval reduced to 100K from millions of instrs.
10K 100K 1M 10M0
250500750
1000125015001750200022502500
apachesshdsquidmysql
Checkpoint Interval (Instructions)
Mem
ory
Log
Size
(in
KB)
30
Overheads from Output Buffering
• New techniques reduce output buffer size to near-zero– <5KB buffer for 100K chkpt interval (buffer for 2 chkpts)– Near-zero overheads at 10K interval
10K 100K 1M 10M0
2
4
6
8
10
12
14
16
18
20apachesshdsquidmysql
Checkpoint Interval (Instructions)
Out
put B
uffe
r siz
e (in
KB
)
31
Low Cost Fault Recovery (future work)
• New techniques significantly reduce recovery overheads– 60% in memory logs, near-zero output buffer
• But still do not enable ultra-low cost fault recovery– ~400KB HW overheads for memory logs in HW (SafetyNet)– High performance impact for in-memory logs (ReVive)
• Need ultra low-cost recovery scheme at short intervals– Even shorter latencies– Checkpoint only state that matters– Application-aware insights – transactional apps, recovery
domains for OS, …
Fault Diagnosis• Symptom-based detection is cheap but
– May incur long latency from activation to detection– Difficult to diagnose root cause of fault
• Goal: Diagnose the fault with minimal hardware overhead– Rarely invoked higher perf overhead acceptable
SW Bug Transient Fault
PermanentFault
Symptom
?
SWAT Single-threaded Fault Diagnosis [Li et al., DSN ‘08]
• First, diagnosis for single threaded workload on one core– Multithreaded w/ multicore later – several new challenges
Key ideas• Single core fault model, multicore fault-free core available• Chkpt/replay for recovery replay on good core, compare• Synthesizing DMR, but only for diagnosis
Traditional DMR
P1 P2
=
Always on expensive
P1 P2
=
P1
Synthesized DMR
Fault-freeDMR only on fault
SW Bug vs. Transient vs. Permanent• Rollback/replay on same/different core• Watch if symptom reappears
No symptom SymptomDeterministic s/w orPermanent h/w bug
Symptom detected
Faulty Good
Rollback on faulty core
Rollback/replay on good core
Continue Execution
Transient or non-deterministic s/w bug
Symptom
Permanenth/w fault,
needs repair!
No symptomDeterministic s/w bug(send to s/w layer)
µarch-level Fault Diagnosis
Permanent fault
Microarchitecture-levelDiagnosis
Unit X is faulty
Symptomdetected
Diagnosis
Softwarebug
Transientfault
Trace Based Fault Diagnosis (TBFD)
• µarch-level fault diagnosis using rollback/replay
• Diagnose faults to µarch units of processor– Check µarch-level invariants in several parts of processor– Diagnosis in out-of-order logic (meta-datapath) complex
Trace-Based Fault Diagnosis: Evaluation• Goal: Diagnose faults at reasonable latency
• Faults diagnosed in 10 SPEC workloads– ~8500 detected faults (98% of unmasked)
• Results– 98% of the detection successfully diagnosed– 91% diagnosed within 1M instr (~0.5ms on 2GHz proc)
SWAT Multithreaded Fault Diagnosis [Hari et al., MICRO ‘09]
• Challenge 1: Deterministic replay involves high overhead• Challenge 2: Multithreaded apps share data among threads
• Symptom causing core may not be faulty• No known fault-free core in system
Core 2
Fault
Core 1
Symptom Detectionon a fault-free core
Store
Memory
Load
mSWAT Diagnosis - Key Ideas
Challenges
Multithreaded applications
Full-system deterministic
replay
No known good core
Isolated deterministic
replayEmulated TMRKey Ideas
TA TB TC TD
TA
TA TB TC TD
TA
A B C D
TA
A B C D
mSWAT Diagnosis - Key Ideas
Challenges
Multithreaded applications
Full-system deterministic
replay
No known good core
Isolated deterministic
replayEmulated TMRKey Ideas
TA TB TC TD
TA
TA TB TC TD
A B C DA B C D
TD TA TB TC
TC TD TA TB
mSWAT Diagnosis: Evaluation
• Diagnose detected perm faults in multithreaded apps– Goal: Identify faulty core, TBFD for µarch-level diagnosis– Challenges: Non-determinism, no fault-free core known– ~4% of faults detected from fault-free core
• Results– 95% of detected faults diagnosed
* All detections from fault-free core diagnosed– 96% of diagnosed faults require <200KB buffers
* Can be stored in lower level cache low HW overhead
• SWAT diagnosis can work with other symptom detectors
Summary: SWAT works!
In-situ diagnosis [DSN’08]
Very low-cost detectors [ASPLOS’08, DSN’08]Low SDC rate, latency