15/09/2009 1 New challenges for designers of fault tolerant Embedded Systems based on future technologies Instituto de Informática, Programa de Pós-Graduação em Computação Universidade Federal do Rio Grande do Sul - Porto Alegre, RS, Brazil IESS - Schloβ Langenargen, Germany – September 15 th , 2009 on future technologies Carlos Arthur Lang Lisbôa Luigi Carro Outline • Introduction: concepts and definitions • Motivation: new challenges imposed by future technologies • Radiation induced faults: the major challenges E i ti iti ti t hi th i Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 2 • Existing mitigation techniques vs. the new scenario • Desired properties of new radiation induced faults mitigation techniques • Recent solutions working at different abstraction levels to deal with transient faults • Conclusions Concepts and Definitions • Faults • Errors Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 3 Errors • Failures • Duration of errors and faults o Permanent Concepts and Definitions Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 4 o Transient o Intermittent Technology trends (1) Device size are decreasing • Transistor size Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 5 Nodes capacitances are decreasing Technology trends (2) Power Supply • Transistor Vth Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 6 Threshold Voltage Nodes voltages are decreasing
25
Embed
New challenges for Outline designers of fault tolerant ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
15/09/2009
1
New challenges for designers of fault tolerant
Embedded Systems based on future technologies
Instituto de Informática, Programa de Pós-Graduação em ComputaçãoUniversidade Federal do Rio Grande do Sul - Porto Alegre, RS, Brazil
IESS - Schloβ Langenargen, Germany – September 15th, 2009
on future technologies
Carlos Arthur Lang Lisbôa Luigi Carro
Outline
• Introduction: concepts and definitions• Motivation: new challenges imposed by future
technologies• Radiation induced faults: the major challenges
E i ti iti ti t h i th i
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 2
• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced faults
mitigation techniques• Recent solutions working at different abstraction
levels to deal with transient faults• Conclusions
Concepts and Definitions
• Faults
• Errors
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 3
Errors
• Failures
• Duration of errors and faults
o Permanent
Concepts and Definitions
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 4
o Transient
o Intermittent
Technology trends (1)
Device size are decreasing
• Transistor size
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 5
Nodes capacitances are
decreasing
Technology trends (2)
Power Supply
• Transistor Vth
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 6
Threshold Voltage
Nodes voltages are decreasing
15/09/2009
2
Single event upset
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 7
A transistor changes from OFF to ON state!
SEE and Technology trends (1)
• Consequences of C and V reductionHIGH C + HIGH V HIGH Q=C.V
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 8
SEE and Technology trends (2)
LOW C + LOW V LOW Q=C.V
• Consequences of C and V reduction
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 9
• Radiation Induced Faultso Single Event Effects – SEEs
o Single Event Transients – SETs
Concepts and Definitions
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 10
o Single Event Upsets – SEUs
o Soft Error - SE
o Multiple Bit Upsets – MBUs
• Soft Error Rate - SER
The Soft Error Problem
Single Event Upset (SEU)
CLK
DQ0
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 11
1CLK
DQ
1CLK
DQ
The Soft Error Problem
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 12
Transient Fault Soft Error
15/09/2009
3
• Masking of faults and errors
o Logical
o Latching window
Concepts and Definitions
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 13
g
o Electrical
o Architectural
o Software
• Logical: faulty value does not affect logical operation of the circuit
Example of Fault Masking in Microprocessors
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 14
0
0
[Blome et al, CASES, 2006]
Example of Fault Masking in Microprocessors
• Latching-Window: the fault pulse does not reach a state element within the latching window
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 15
CLK
tsetup thold
[Blome et al, CASES, 2006]
• Electrical: the fault pulse is electrically attenuated by subsequent gates in the circuit
Example of Fault Masking in Microprocessors
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 16
[Blome et al, CASES, 2006]
mov r2, 4Register File
mov r2, 4
Example of Fault Masking in Microprocessors
• Architectural/Software: incorrect state is written before it is read
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 17
mov r5, 8------
…
decoder
012345
add r6, r2, r5
mov r5, 8
add r6, r2, r5
[Blome et al, CASES, 2006]
mov r2, 4Register File
mov r2, 4
Example of Fault Masking in Microprocessors
• Architectural/Software: incorrect state is written before it is read
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 18
mov r5, 8--4---
…
decoder
012345
add r6, r2, r5
mov r5, 8
add r6, r2, r5
[Blome et al, CASES, 2006]
15/09/2009
4
mov r2, 4Register File
mov r2, 4
Example of Fault Masking in Microprocessors
• Architectural/Software: incorrect state is written before it is read
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 19
mov r5, 8--4--9
…
decoder
012345
add r6, r2, r5
mov r5, 8
add r6, r2, r5
[Blome et al, CASES, 2006]
mov r2, 4Register File
mov r2, 4
Example of Fault Masking in Microprocessors
• Architectural/Software: incorrect state is written before it is read
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 20
mov r5, 8--
--8
…
decoder
012345
add r6, r2, r5
mov r5, 8
add r6, r2, r5
[Blome et al, CASES, 2006]
4
• Introduction: concepts and definitions• Motivation: new challenges imposed by future
technologies• Radiation induced faults: the major challenges
E i ti iti ti t h i th i
Outline
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 21
• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced faults
mitigation techniques• Recent solutions working at different abstraction
levels to deal with transient faults• Conclusions
• The good news:
o Smaller devices→ Denser circuits, less area
F d i
Motivation: Future Technologies
☺
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 22
o Faster devices→ Higher performance
o Less power consumption→ Longer battery life (portable systems)
• The bad news:
o Higher defect rates→ Lower yield
Motivation: Future Technologies
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 23
o Higher sensitivity to radiation→ Increased SER: combinational logic→ Multiple simultaneous faults→ Long duration transients
• Introduction: concepts and definitions• Motivation: new challenges imposed by future
technologies• Radiation induced faults: the major challenges
E i ti iti ti t h i th i
Outline
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 24
• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced faults
mitigation techniques• Recent solutions working at different abstraction
levels to deal with transient faults• Conclusions
15/09/2009
5
Major Challenges
• Long Duration Transients (LDTs)Different paces in transient widths vs. device speed scaling will lead to transient pulses lasting longer than cycle times of circuits. Temporal redundancy techniques will not cope
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 25
Temporal redundancy techniques will not cope.
• Multiple Simultaneous FaultsSmaller distances between devices will allow a single particle to affect more than one device. The single fault model will fail.
Transient width studies
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 26
DODD, 2004 FERLET-CAVROIS, 2006
Propagation delay(*) vs. Technologies
Technology (nm) 180 130 90 32 180/32
10-inverter chain 508.4 157.8 120.2 79.6 6.39
i t
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 27
in out
clk clk
32 nm
90 nm
130 nm
180 nm
(*) simulated using parameters from PTM web site and HSPICE tool
Transient widths vs. Propagation delays
Cycle time and transient width scaling across technologies
400
500
600
e (p
s) Width 20MeVWidth 10MeV
Transientwidth scaling:
max. 1.37 x
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 28
0
100
200
300
180nm 130nm 100nm 90nm 70nm 32nmTechnology
Cyc
le ti
me Cycle 10 Inv
Cycle 8 InvCycle 6 InvCycle 4 Inv
(*)
(*) 180, 130, and 100nm from [DODD, 2004], 70 nm from [Ferlet-Cavrois 2006]
6.39 x
Single event, multiple effects[Rossi 2005 *]
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 29
[*] Multiple Transient Faults in Logic: An Issue for Next Generation ICs ?, Daniele Rossi et al, DFT 2005
• Introduction: concepts and definitions• Motivation: new challenges imposed by future
technologies• Radiation induced faults: the major challenges• Existing mitigation techniques vs the new
Outline
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 30
• Existing mitigation techniques vs. the new scenario
• Desired properties of new radiation induced faults mitigation techniques
• Recent solutions working at different abstraction levels to deal with transient faults
• Conclusions
15/09/2009
6
• Time Redundancy [Anghel et al, 2000]
LDT Effects on Temporal Redundancy
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 31
• Time Redundancy [Anghel et al, 2000]
Increase delay ?
LDT Effects on Temporal Redundancy
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 32
Increase delay ?⇒ Higher performance
penalty !!!
LDT Effects on Space Redundancy
• Space Redundancy [Nieuwland et al, 2006]
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 33
LDT Effects on Space Redundancy
• Space Redundancy [Nieuwland et al, 2006]
Can not copewith long duration
t i t !!!
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 34
transients !!!
LDT Effects on Space Redundancy
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 35
- DMR can cope with LDTs affecting one of the modules
- allows detection only, requires recomputation
- area and power overheads above 100% (too much for ES)
- weak point: comparator
LDT Effects on Space Redundancy
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 36
- TMR can cope with LDTs affecting one of the modules
- allows detection and correction
- area and power overheads above 200% (too much for ES)
- weak point: voter
15/09/2009
7
Multiple simultaneous errors [Sorin 2009 *]
• It is an interesting open problem.• If forecasts of greatly increased fault rates
come to pass, error detection schemes targeting single error scenarios may be insufficient
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 37
insufficient.• Most of current schemes assume a single
error scenario.• Some existing schemes may do well, but
there are no results demonstrating that capability.
[*] Fault Tolerant Computer Architecture, Daniel J. Sorin, Morgan & Claypool, 2009
Multiple Effects vs. Space Redundancy
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 38
- DMR: what if a single particle affects two modules ?
- different output bits affected (O1i, O2j) → OK
- same output bit affected (O1k, O2k)→ PROBLEM ! Comparator will not detect error
Multiple Effects vs. Space Redundancy
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 39
- TMR: what if a single particle affects two modules ?
- different output bits affected (O1i, O2j) → no majority !
- same output bit affected (O1k, O2k)→ EVEN WORSE → Voter will select erroneous output !
• Introduction: concepts and definitions• Motivation: new challenges imposed by future
technologies• Radiation induced faults: the major challenges
E i ti iti ti t h i th i
Outline
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 40
• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced
faults mitigation techniques• Recent solutions working at different abstraction
levels to deal with transient faults• Conclusions
Analysis
• Currently known mitigation techniques based on temporal redundancy can not cope with LDTs.
• Space redundancy based mitigations techniques:- able to cope with LDTs;
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 41
p ;- may fail when subject to multiple faults; - impose very high area and power overheads;- not suited for the Embedded Systems arena.
• The development of new low cost techniques to face those new challenges is mandatory.
Desired properties of new approaches
• Tolerance to LDTs and multiple simultaneous faults.
• Error detection area overhead << DMR
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 42
• Error correction area overhead << TMR
• Low performance overhead
• Additional concern for Embedded Systems:low power consumption
15/09/2009
8
Suggested approach
System LevelAlgorithm Level
Architecture Level
Work at higher abstraction levels with low cost
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 43
Architecture LevelCircuit Level
Component LevelTechnology Level
“Computer users do not notice if a transistor failsor a bit of SRAM is flipped by a cosmic ray;
they notice when their programs crash” [Sorin, 2009]
• Introduction: concepts and definitions• Motivation: new challenges imposed by future
technologies• Radiation induced faults: the major challenges
E i ti iti ti t h i th i
Outline
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 44
• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced faults
mitigation techniques• Recent solutions working at different
abstraction levels to deal with transient faults• Conclusions
System Level
Recently proposed solutions (1 of 6)
Working at circuit level with low cost to cope with increased SER in combinational logic
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 45
Algorithm LevelArchitecture Level
Circuit LevelComponent LevelTechnology Level
CombinationalHamming
SER evolution[*]
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 46
[*] Baumann, R., “Soft Errors in Advanced Computer Systems”, IEEE Design and Test of Computers, vol. 22, no. 3, IEEE Computer Society, New-York-London, May-June 2005, pp 258-266.
SER Trend: Latches & Chip impactSER Trend: Full Chip
10
m
logic
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 47
1180 130 90 65 45 32
Technology (nm)
SER
Nor
m to
130
nm cache arrays
Source: Intel Barcelona
Combinational Hamming
Conventional Hamming applications: - data storage and communications hardening- number of inputs = number of outputs
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 48
Combinational logic: number of inputs ≠ number of outputs
number of inputs number of outputs
15/09/2009
9
Combinational Hamming
Hamming codeword for 4-output circuits
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 49
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 61
[*] Freivalds, R. 1979. Fast probabilistic algorithms. In Mathematical Formulations of CS. Lecture Notes in Computer Science, vol. 74. Springer-Verlag, New York, pp. 57–69.
Vector r: random 0’s and 1’s
Freivalds’ technique
× ⇒A11 . . . A1n
. . . . . . . . .
An1 . . . Ann
B11 . . . B1n
. . . . . . . . .
Bn1 . . . Bnn
C11 . . . C1n
. . . . . . . . .
Cn1 . . . Cnn
r1
. . .
rn
Cr1
. . .
Crn
× ⇒
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 62
B11 . . . B1n
. . . . . . . . .
Bn1 . . . Bnn
A11 . . . A1n
. . . . . . . . .
An1 . . . Ann
r1
. . .
rn
Ar1
. . .
Arn
× ⇒ABr1
. . .
ABrn
× ⇒
Vector r: random 0’s and 1’s
Freivalds’ technique
× ⇒A11 . . . A1n
. . . . . . . . .
An1 . . . Ann
B11 . . . B1n
. . . . . . . . .
Bn1 . . . Bnn
C11 . . . C1n
. . . . . . . . .
Cn1 . . . Cnn
r1
. . .
rn
Cr1
. . .
Crn
× ⇒
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 63
B11 . . . B1n
. . . . . . . . .
Bn1 . . . Bnn
A11 . . . A1n
. . . . . . . . .
An1 . . . Ann
If Cr = ABr, OK, otherwise, ERROR
r1
. . .
rn
Ar1
. . .
Arn
× ⇒ABr1
. . .
ABrn
× ⇒
=?
Basic subject technique [*]
• The main difference w. r. t. the Freivalds’ technique is that here the r Vector has only 1’s.
• This means that to calculate Ar and Cr only additions are needed no multiplications
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 64
additions are needed, no multiplications.
• The computational cost of verification is thereby significantly decreased.
[*] Lisbôa, C. A., Erigson, M. I., and Carro, L., “System level approaches for mitigation of long durationtransient faults in future technologies”, in Proceedings of the 12th IEEE European Test Symposium -ETS 2007, pp. 165-170, IEEE Computer Society, Los Alamitos, CA, May 2007.
Basic subject technique
× ⇒A11 . . . A1n
. . . . . . . . .
An1 . . . Ann
B11 . . . B1n
. . . . . . . . .
Bn1 . . . Bnn
C11 . . . C1n
. . . . . . . . .
Cn1 . . . Cnn
Cr1
. . .
Crn
Cri = ΣCik,
k=1...n⇒
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 65
B11 . . . B1n
. . . . . . . . .
Bn1 . . . Bnn
A11 . . . A1n
. . . . . . . . .
An1 . . . Ann
If Cr = ABr, OK, otherwise, ERROR
Ar1
. . .
Arn
ABr1
. . .
ABrn
× ⇒
=?
Ari = ΣAik,
k=1...n⇒
Extended Subject Technique [*]
B11 B12 B1n
B21 B22 B2n
Br1
Br2
...
... Σ
• compute vectors Br and BrT (only sums)
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 66
Σ
⇒Bn1 Bn2 Bnn Brn
... ... ......
...
...
BrT1 BrT
2 BrTn...
⇒
[*] Lisboa, C.; Argyrides, C.; Pradhan, D.; and Carro, L., “Algorithm Level Fault Tolerance: a Technique to Cope with Long Duration Transient Faults in Matrix Multiplication Algorithms” , in Proceedings of the 26th
IEEE VLSI Test Symposium (VTS 2008), San Diego, CA, USA, April 2008.
15/09/2009
12
Extended Subject Technique
• compute vectors Br and BrT (only sums)• compute vectors ABr = A × Br and ABrT = A × BrT
BrT1 BrT
2 BrTn...
×
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 67
Br1
Br2
Brn
...
A11 A12 A1n
A21 A22 A2n
An1 An2 Ann
... ...
...
...
...
...
...
ABrT1 ABrT
2 ABrTn...
×
ABr1
ABr2
ABrn
...⇒
⇒
Extended Subject Technique
• compute vectors Br and BrT (only sums)• compute vectors ABr = A × Br and ABrT = A × BrT
• compute vectors Cr and CrT (only sums)
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 68
Σ
⇒
C11 C12 C1n
C21 C22 C2n
Cn1 Cn2 Cnn
Cr1
Cr2
Crn
... ... ...
...
...
...
...
...
CrT1 CrT
2 CrTn...
⇒
Σ
Extended Subject Technique
• Verification:• If ABr = Cr AND ABrT = CrT, then NO ERROR
⇒Cr1
Cr2
C11 C12 C1n
C21 C22 C2n
...
...
ABr1
ABr2!=
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 69
• Otherwise: Crn
...
Cn1 Cn2 Cnn
... ... ...
...
...
CrT1 CrT
2 CrTn...
ABrn
...
ABrT1 ABrT
2 ABrTn...
⇒
!=
6129
15744
2937
6129
9637
2937
‐2082
2160
2280
‐3582
‐61
3222
11793
13645
‐2565
!=
Extended Subject Technique - Example
C = Cr = ABr =
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 70
2358 ‐6528 22873
2358 ‐421 22873
!=
CrT =
ABrT =
Results: Verification Cost
Total Verification Cost (# of add equivalent operations)
n Multiplication Freivalds Subject Extended2 36 58 26 52
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 71
Working at algorithm level with low costfor runtime error detection
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 78
Algorithm LevelArchitecture Level
Circuit LevelComponent LevelTechnology Level
gfor Runtime Error
Detection
15/09/2009
14
Goal
• Achieve tolerance to long duration transient pulses
• at algorithmic level
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 79
• with low performance overhead
• in an automatic fashion
• generalized to other algorithms
Alternative approaches
• Software based error detection techniques
• Duplication with Comparison: increases memory usage and execution time. [Rebaudengo et al, 1999]
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 80
• Self Checking Block Signatures: imposes coding and performance penalties. [Goloubeva et al, 2003]
• Use of object oriented languages and libraries in some approaches leads to increased memory footprint and requires source code modification. [Benso, 2005]
Alternative approaches
• An algorithm level technique is proposed in [Lisboa, 2007] for matrix multiplication hardening• Far less computational cost than recompute and
compare (32x32 matrix – only 4.97% time increase).
• Explores algorithm properties: conditions that hold
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 81
• Explores algorithm properties: conditions that hold after the execution of the algorithm - known as program invariants or post conditions - are checked.
Use algorithm properties as a mean forrun-time error detection.
IDEA
Subject technique
• Invariants
• Properties that always hold during program execution:
• Pre-conditions
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 82
Pre conditions• Post-conditions• Loop invariants
• Usually used in the software engineering arena,to check if a program performs its tasks as expected after maintenance.
Subject technique• Daikon Tool [Ernst et al, 2001]
• Automatically detects potential invariants for a given program.
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 83
• Identification of a testable set of invariants feasible for small programs.
• Linear relationships between up to 3 variables.• Low support to complex data structures.
Methodology
• Fault injection campaigns• Main program is divided into smaller, less complex,
pieces of code.
• Daikon is used to extract the invariants of each part.
Verification code is appended after the algorithm code
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 84
• Verification code is appended after the algorithm code.
IncludeVerification
Code
main(){
}
ProgramBody
main(){
}
Program Slice
Program Slice
Program Slice
InvariantDetector
Invariants
decompose
15/09/2009
15
Methodology
PerformanceEvaluation
Fault CoverageEvaluation
ModifiedCode
main(){GenerateReference 1
• Fault coverage and performance evaluation
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 85
}
Program SliceVerification
Program SliceVerification
Program SliceVerification
TimingReport
Yes
No
Random FaultSetup
CheckDetection
FaultInjection
6AnalysisReport
5
3 4
2
Program Slice
Verification
F times?
Methodology
• Reference and execution results are compared.
• Comparison of results is confronted with verification flag.
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 86
• Statistical analysis with report generation.
Experimental results and analysis
• The subject methodology was applied to a test program, split into 5 code pieces:
• Evaluation of the Baskara formula ( domain ).
Iterative integer multiplication
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 87
• Iterative integer multiplication.
• Conditional statement execution.
• Arithmetic expression evaluation.
• Square root calculation.
Experimental results and analysis
/* baskara() */x1=-1.1;x2=-1.1;if (a==0 && b!=0){
x1=-c/b;x2=x1;
}l {
/* mult() */while(k2>0){
if ((k2%2)==0 ){k2/=2;x2+=x2;
}else{
k2--;m2+=x2;
}}/* biggerminus() */if(m1>m2){
bg=m1-m2;}
Test case program
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 88
• Provides a low cost error detection mechanism, when invariants are detected.
• Better performance using program slices.
• Coverage still low.
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 94
Coverage still low.
• Coding style to enhance detection.
• Lack of automatic tools to handle complex data structures.
• Automatic generation of invariants is still a bottle-neck.
System LevelAlgorithm Level
SIFTSoftware Implemented
Fault Tolerance
Recently proposed solutions (4 of 6)
Working at software level
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 95
Architecture LevelCircuit Level
Component LevelTechnology Level
Fault Tolerance
Data-oriented Approaches
• Provide a solution for tolerating the effects of faults affecting the data program manipulates
• Introduced by Rebaudengo, Politecnico di Torino Italy
SWIFT
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 96
Torino, Italy• Used for hardening any operation among
variables• Based on automatic algorithm-level
modifications that introduce information (duplication code) and time redundancies
[Violante, M. Politecnico di Torino, 2006]
15/09/2009
17
SWIFT
Basic principle:• Each variable must be replicated two times• Each operation among variables must be replicated
two times• After every usage of a variable, its two replicas must be
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 97
checked for consistency
SWIFT
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 98
[Violante, M. Politecnico di Torino, 2006]
SWIFT
Success-stories:• Motorola 68040• Intel 8051• IBM PowerPC
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 99
IBM PowerPC• Gaisler LEON1/LEON2Fault models:• SEUs• SETs
[Violante, M. Politecnico di Torino, 2006]
ED4I
• Introduced by McCluskey, Stanford University, USA• Used for hardening any operation among variables• Based on algorithm-level modifications that
Introduces time redundancies (replicated with shifted operands)
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 100
p )
Basic principle:• Compute one solution S=f(x)• Compute a shifted solution S’=f(x.k)• Verify whether S and S’ are consistent
[Violante, M. Politecnico di Torino, 2006]
ED4I
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 101
[Violante, M. Politecnico di Torino, 2006]
ED4I
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 102
[Violante, M. Politecnico di Torino, 2006]
15/09/2009
18
ED4I
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 103
[Violante, M. Politecnico di Torino, 2006]
Control-oriented Approaches
• Provide a solution for tolerating the effects of faults affecting the programs’ execution flow
Control Flow Errors
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 104
Control Flow Errors
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 105
[Violante, M. Politecnico di Torino, 2006]
ECCA• Introduced by Abraham, University of Texas, USA• Used for detecting contro-flow errors
Based on:• Modifications to the program source code• Trigger of division-by-zero exception for error detection
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 106
Basic approach:• Assign an odd signature to each program’s basic block• Maintain run-time signature with the currently executed basic block• While entering a basic block, set the run-time signature according to
the current basic block and check the correctness of the flow• While exiting a basic blocks, set the run-time signature according to
the next basic block
[Violante, M. Politecnico di Torino, 2006]
ECCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 107
[Violante, M. Politecnico di Torino, 2006]
ECCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 108
[Violante, M. Politecnico di Torino, 2006]
15/09/2009
19
ECCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 109
[Violante, M. Politecnico di Torino, 2006]
ECCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 110
[Violante, M. Politecnico di Torino, 2006]
CFCSS• Introduced by McClusckey, Stanford University, USA• Used for detecting control-flow errors
Based on:• Modifications to the program source code• Use logic operations to track control-flow execution
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 111
Basic approach:• Assign a signature to each program’s basic block• During program execution, a run-time signature is continuously
updated• While entering a basic block:
• The run-tine signature is updated• The consistency of the run-time signature with a pre-defined one
is evaluated[Violante, M. Politecnico di Torino, 2006]
CFCSS
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 112
[Violante, M. Politecnico di Torino, 2006]
CFCSS
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 113
[Violante, M. Politecnico di Torino, 2006]
CFCSS
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 114
[Violante, M. Politecnico di Torino, 2006]
15/09/2009
20
CFCSS
• Low-cost techniques:• Logic operations are not time consuming• Few operations are added, resulting in low code
penalty
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 115
• Error detection is very critical: it changes the program’s graph by introducing a jump
[Violante, M. Politecnico di Torino, 2006]
YACCA
• Introduced by MassimoViolante, Politecnico di Torino, Italy
• Used for detecting control-flow errors
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 116
Based on:• Modifications to the program source code• Use logic operations to track control-flow
execution
[Violante, M. Politecnico di Torino, 2006]
YACCA
Basic principle:• Two signatures are assigned to each program’s
basic block (enter and exit signatures, Bx1, Bx2)• A run-time signature is constantly updated• When entering a basic block:
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 117
When entering a basic block:• Check the correctness of the execution• Set the run-time signature to the enter one
• When exiting a basic block:• Check the correctness of the execution• Set the run-time signature to the exit one
[Violante, M. Politecnico di Torino, 2006]
YACCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 118
[Violante, M. Politecnico di Torino, 2006]
YACCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 119
[Violante, M. Politecnico di Torino, 2006]
YACCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 120
[Violante, M. Politecnico di Torino, 2006]
15/09/2009
21
YACCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 121
[Violante, M. Politecnico di Torino, 2006]
YACCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 122
[Violante, M. Politecnico di Torino, 2006]
YACCA
• Low-cost techniques:• Logic operations are not time consuming• Few operations are added, resulting in low
code penanltyThe program’s graph is not modified
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 123
• The program’s graph is not modified
[Violante, M. Politecnico di Torino, 2006]
Comparison
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 124
• Data increase:• Un-hardened program: 1.0• ABFT: 2.0x• ED4I: 1.9x
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 130
• SWIFT+YACCA: 2.2x
[Violante, M. Politecnico di Torino, 2006]
Hybrid SIFT
• Software-only SIFT may introduce unacceptabletime penalty
• Moving in hardware some tasks may reduce this overhead
• Masking detection location and recovery
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 131
• Masking, detection, location, and recovery implemented in software and in hardware
• Possible approaches:• Lockstep execution• Watchdogs• Lightweight watchdogs
System LevelAlgorithm Level
SWATSoftWare Anomaly
Treatment
Recently proposed solutions (5 of 6)
Working at system (software and hardware) level
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 132
gArchitecture Level
Circuit LevelComponent LevelTechnology Level
Li, M.-L.; Ramachandran, P.; Sahoo, S. K.; Adve, S.; Adve, V.; and Zhou, Y. Understanding thepropagation of hard errors to software and implications for resilient system design. In Proc. of the
13th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2008.
15/09/2009
23
Main concepts
• Detection of errors when they affect software behavior is preferable to detection at hardware level
• SWAT exploits this concept to achieve low cost error detection for cores at software level, by checking:o Fatal exceptions
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 133
o Program crashes or hangso Unusually high amount of operating system activity
• Some hardware errors that do not manifest themselves in software behaviors are not detected by SWAT
• SWAT suffers from the drawbacks of high level error detection mechanisms that will be discussed later
Application Layer
Middleware/Architectural Layer
Recently proposed solutions (6 of 6)
Working at lower levels to detect errorsand at higher system levels to correct them.
rtsatio
n
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 134
Middleware/Architectural Layer
Configurable/Programming Layer
Register/Logic Layer
Technology Layer
Albrecht, C.; Koch, R.; Pionteck, T.; and Glösekötter, P. Towards a Flexible Fault-TolerantSystem-on-Chip. 22th International Conference on Architecture of Computing Systems
• Each layer has specific fault tolerance mechanisms:
o Detection is cheaper at lower layers
o Correction is better performed at higher layers
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 135
o Correction is better performed at higher layers
• Lower layers notify upper layers when error is detected
• Upper layers send reconfiguration information to lower layers according to application requirements
• Key issue: interfaces between layers to report errors and inform about needed level of reliability according to application
Sample roles of layers
• Technology layero Built-in current sensors detect transient upsetso Upper layer can configure detection capabilities
• Register/Logic layer
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 136
o EDAC used to harden memorieso TMR used to harden logico Upper layer can enable/disable detection mechanisms
• Configuration/Programming layer (in reconfigurable platforms)o Reconfiguration can be used to disable faulty moduleso Periodical relocation of active modules reduces degradation
Sample roles of layers• Middleware/Architectural layer
o Applies well-known redundancy techniques such as TMR at component level
o Redundant modules designed independently to allow SEU and design errors detection
o Test mechanisms can be used to check modules at run
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 137
timeo Checkpoints can be used to allow error recovery
• Application layero Almost everything can be used to improve reliability at this
levelo Software implemented TMR, EDAC and other techniques
can be used
• Introduction: concepts and definitions• Motivation: new challenges imposed by future
technologies• Radiation induced faults: the major challenges
E i ti iti ti t h i th i
Outline
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 138
• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced faults
mitigation techniques• Recent solutions working at different abstraction
levels to deal with transient faults• Conclusions
15/09/2009
24
Conclusions
• New low cost mitigation techniques, providing error detection and error correction must be developed
• Circuit level approaches can be better
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 139
• Circuit level approaches can be better than TMR, but still impose significant area and power overheads
• Algorithm level mitigation is a better approach, but it is hard to generalize and automate
High level error detection: pros and cons
[Sorin, 2009]• Checking at a higher level:
• reduces hardware costs
• reduces the number of false positives
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 140
• is necessary anyway for certain types of errors
• However:
• provides little diagnostic information (type and location)
• longer and potentially unbounded error detection latency
• recovery process may be more complex
Final Remark
•There is NO silver bullet!•Combine hardware and software based techniques at different
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 141
based techniques at different levels
•Leverage on specific strengths of each technique at each level.
Thank You !
Questions ?
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 142
Copy of slides available at http://www.inf.ufrgs.br/~calisboa/IESS2009
References (in order of appearance)
• BLOME, J. A., GUPTA, S., FENG, S., and MAHLKE, S. Cost-efficient soft error protection for embedded microprocessors. In: INTERNATIONAL CONFERENCE ON COMPILERS, ARCHITECTURE AND SYNTHESIS FOR EMBEDDED SYSTEMS, CASES 2006, 2006, Proceedings… Los Alamitos, USA: IEEE Computer Society, 2006, p421-431.
• DODD, P. et al. Production and propagation of single-event transients in high-speed digital logic ics. IEEE Transactions On Nuclear Science, Los Alamitos, USA: IEEE Computer Society, 2004, v. 51, n. 6 (part 2), p.3278–3284.
• FERLET-CAVROIS V et al Statistical analysis of the charge collected in SOI and bulk devices
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 143
FERLET CAVROIS. V. et al. Statistical analysis of the charge collected in SOI and bulk devicesunder heavy ion and proton irradiation—implications for digital SETs. IEEE Transactions OnNuclear Science, Los Alamitos, USA : IEEE Computer Society, 2006, v. 53, n. 6 (part 1), p. 3242-3252.
• ROSSI, D. et al. Multiple transient faults in logic: an issue for next generation ICs? In: IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI SYSTEMS, 20., DFT 2005, 2005, Monterey, USA. Proceedings… Los Alamitos, USA: IEEE Computer Society, 2005, p. 352-360.
• ANGHEL, L.; NICOLAIDIS, M. Cost reduction and evaluation of a temporary faults detection technique. In.: DESIGN, AUTOMATION AND TEST IN EUROPE CONFERENCE, 2000, DATE 2000, Paris, FRA. Proceedings… New York, USA: ACM Press, 2000, p. 591-598.
References (in order of appearance)
• NIEUWLAND, A.; JASAREVIC, S.; JERIN, G. Combinational logic soft error analysis and protection. In: IEEE INTERNATIONAL ON-LINE TEST SYMPOSIUM, 12., IOLTS 2006, Lake of Como, ITA. Proceedings… Los Alamitos, USA: IEEE Computer Society, 2006. p. 99-104.
• SORIN, D. J., Fault Tolerant Computer Architecture, Morgan & Claypool, USA : 2009
• PRADHAN, D. Fault-tolerant computer system design. Upper Saddle River, USA : Prentice-Hall, 1995.
• BAUMANN, R. Soft errors in advanced computer systems. IEEE Design and Test of Computers,
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 144
p y g pNew York, USA: IEEE Computer Society, 2005, v. 22, n. 3, p. 258-266.
• HAMMING, R. Error Detecting and Error Correcting Codes. The bell system technical journal, 2005, v. 26, n. 2, p. 147-160.
• ALMUHKAIZIM, S. and MAKRIS, Y., “Fault Tolerant Design of Combinational and Sequential Logic based on a Parity Check Code”, in Proceedings of th 18th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT 2003), IEEE Computer Society, Los Alamitos, CA, October 2003, pp. 344-351.
• FREIVALDS, R. Fast probabilistic algorithms. In: FREIVALDS, R. Mathematical Formulations of CS. New York, USA: Springer-Verlag, 1979. p. 57-69. (Lecture Notes in Computer Science).
15/09/2009
25
References (in order of appearance)
• LISBOA, C. A., ERIGSSON, M. I., and CARRO, L. System level approaches for mitigation of long duration transient faults in future technologies. In: IEEE EUROPEAN TEST SYMPOSIUM, 12., ETS 2007, Freiburg, DEU. Proceedings… Los Alamitos, USA: IEEE Computer Society, 2007, p. 165-170.
• LISBOA, C.; ARGYRIDES, C.; PRADHAN, D.; and CARRO, L. Algorithm level fault tolerance: a technique to cope with long duration transient faults in matrix multiplication algorithms. In: IEEE VLSI TEST SYMPOSIUM, 26., VTS 2008, San Diego, USA. Proceedings… [S.l.: s.n.], 2008.
• LISBOA C et al Invariant checkers: an efficient low cost technique for run-time transient errors
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 145
LISBOA, C. et al. Invariant checkers: an efficient low cost technique for run time transient errors detection. In: IEEE INTERNATIONAL ON-LINE TESTING SYMPOSIUM, 15., IOLTS 2009, Sesimbra, POR. Proceedings… [S.l.: s.n.], 2009.
• REBAUNDENGO, M. et al. Soft-error detection through software fault-tolerance techniques. In: IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI SYSTEMS, 14., DFT1999, 1999, Albuquerque, USA. Proceedings… New York, USA: IEEE Computer Society, 1999, p. 210-218.
• GOLOUBEVA, O. et al. Soft error detection using control flow assertions. INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE, 18., 2003, Boston, USA. Proceedings…Los Alamitos, USA: IEEE Computer Society, 2003, p. 581-588.
References (in order of appearance)
• BENSO, A. et al. PROMON: a profile monitor of software applications. In: IEEE WORKSHOP ON DESIGN AND DIAGNOSTICS OF ELECTRONIC CIRCUITS AND SYSTEMS, 8., DDECS05, Sopron, HUN. Proceedings… New York, USA: IEEE Computer Society, 2005, p. 81-86.
• [DAIKON] ERNST, M.; COCKRELL, J.; GRISWOLD, W. Dynamically discovering likely program invariants to support program evolution. IEEE Transactions on Software Engineering. New York, USA: IEEE Computer Society, 2001, v. 27, n. 2, p.99–123.
• KASTENSMIDT, F.; CARRO, L.; REIS, R. Fault-Tolerance Techniques for SRAM-Based FPGA. New York USA: Springer 2006 183 p REBAUNDENGO M et al Soft-error detection through
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 146
New York, USA: Springer. 2006, 183 p. REBAUNDENGO, M. et al. Soft error detection through software fault-tolerance techniques. In: IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI SYSTEMS, 14., DFT1999, 1999, Albuquerque, USA. Proceedings…New York, USA: IEEE Computer Society, 1999, p. 210-218.
• [ABFT] HUANG, K.; ABRAHAM, J. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers. New York, USA : IEEE Computer Society, 1984, v. C-33, n. 6, p. 518-528.
• [EDDI] OH, N., SHIRVANI, P. P., McCLUSKEY, E.J. EDDI: Error Detection by Duplicated Instructions. IEEE Transactions on Reliability, IEEE Reliability Society ,2002, v. 51, n. 1, p. 63-75.
References (in order of appearance)
• [ED4I] OH, N.; MITRA, S.; McCLUSKEY, E. J. ED4I: error detection by diverse data and duplicated instructions. IEEE Transactions on Computers, IEEE Computer Society, 2002, v. 51, n. 2, p. 180-199.
• [ECCA] ALKHALIFA, Z. et al. Design and evaluation of system-level checks for on-line control flow error detection. IEEE Transactions on Parallel and Distributed Systems, New York, USA: IEEE Computer Society, 1999, v. 10, n. 6, p. 627-641.
• [EDDI] OH, N., SHIRVANI, P. P., McCLUSKEY, E.J. EDDI: Error Detection by Duplicated Instructions IEEE Transactions on Reliability IEEE Reliability Society 2002 v 51 n 1 p 111-
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 147
Instructions. IEEE Transactions on Reliability, IEEE Reliability Society, 2002, v. 51, n. 1, p. 111122.
• [YACCA], VIOLANTE, M. Dependability assurance by design. Internal report, Politecnico di Torino, Italy, available at http://www.cad.polito.it/~sonza/diistp03/lucidi/2007/03-assurance.pdf.
• [SWAT] LI, M.-L.; Ramachandran, P.; Sahoo, S. K.; Adve, S.; Adve, V.; and Zhou, Y. Understanding the propagation of hard errors to software and implications for resilient system design. In Proc. of the 13th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2008.
• ALBRECHT, C. et al. Towards a Flexible Fault-Tolerant System-on-Chip. In: INTERNATIONAL CONFERENCE ON ARCHITECTURE OF COMPUTING SYSTEMS, 22., 2009, ARC 2009, Karlsruhe, GER. Proceedings… Berlin, GER: VDE Verlag GMBH, 2009, p. 83-90.