New challenges for New challenges for designers of fault tolerant designers of fault tolerant E b dd dS t b d Embedded Systems based ft t h l i on future technologies Carlos Arthur Lang Lisbôa Luigi Carro Instituto de Informática, Programa de Pós-Graduação em Computação Universidade Federal do Rio Grande do Sul - Porto Alegre, RS, Brazil IESS - Schloβ Langenargen, Germany – September 15 th , 2009
147
Embed
New challenges forNew challenges for designers of fault ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
New challenges forNew challenges for designers of fault tolerantdesigners of fault tolerant
E b dd d S t b dEmbedded Systems based f t t h l ion future technologies
Carlos Arthur Lang Lisbôa Luigi Carro
Instituto de Informática, Programa de Pós-Graduação em ComputaçãoUniversidade Federal do Rio Grande do Sul - Porto Alegre, RS, Brazilg , ,
IESS - Schloβ Langenargen, Germany – September 15th, 2009
Outline
• Introduction: concepts and definitions• Introduction: concepts and definitions• Motivation: new challenges imposed by future
technologies• Radiation induced faults: the major challengesRadiation induced faults: the major challenges• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced faults
mitigation techniquesmitigation techniques• Recent solutions working at different abstraction
l l t d l ith t i t f ltlevels to deal with transient faults• Conclusions
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 2
Conclusions
Concepts and Definitionsp
Faults• Faults
• Errors
• Failures
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 3
Concepts and Definitions
• Duration of errors and faults
p
• Duration of errors and faults
o Permanent
o Transiento Transient
o Intermittent
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 4
Technology trends (1)Technology trends (1)
T i t i• Transistor size
Device size are decreasing
NodesNodes capacitances are
decreasingdecreasing
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 5
Technology trends (2)Technology trends (2)
T i t Vth
P S l
• Transistor Vth
Power Supply
Threshold Voltage
Nodes voltages are decreasing
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 6
g
Single event upsetSingle event upset
A transistor changes from OFF to ON state!
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 7
g
SEE and Technology trends (1)SEE and Technology trends (1)
• Consequences of C and V reduction• Consequences of C and V reductionHIGH C + HIGH V HIGH Q=C.VQ
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 8
SEE and Technology trends (2)SEE and Technology trends (2)
• Consequences of C and V reductionLOW C + LOW V LOW Q=C.V
• Consequences of C and V reductionQ
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 9
Concepts and Definitions
• Radiation Induced Faults
p
• Radiation Induced Faultso Single Event Effects – SEEso Single Event Effects SEEs
o Single Event Transients – SETso Single Event Transients SETs
o Single Event Upsets – SEUso Single Event Upsets SEUs
o Soft Error - SEo Soft Error - SE
o Multiple Bit Upsets MBUso Multiple Bit Upsets – MBUs
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 12
Concepts and Definitions
• Masking of faults and errors
p
• Masking of faults and errors
o Logical
o Latching window
o Electrical
o Architectural
o Software
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 13
Example of Fault Masking in MicroprocessorsExample of Fault Masking in Microprocessors
• Logical: faulty value does not affect logical operation of the circuitoperation of the circuit
0
0
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 14
[Blome et al, CASES, 2006]
Example of Fault Masking in MicroprocessorsExample of Fault Masking in Microprocessors
• Latching-Window: the fault pulse does not reach a state element within the latchingreach a state element within the latching window
CLK
tsetup thold
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 15
[Blome et al, CASES, 2006]
Example of Fault Masking in MicroprocessorsExample of Fault Masking in Microprocessors
• Electrical: the fault pulse is electrically attenuated by subsequent gates in theattenuated by subsequent gates in the circuit
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 16
[Blome et al, CASES, 2006]
Example of Fault Masking in MicroprocessorsExample of Fault Masking in Microprocessors
• Architectural/Software: incorrect state is written before it is read
Register File
written before it is read
mov r5, 8
mov r2, 4
-
Register File
01mov r5, 8
mov r2, 4
mov r5, 8 ----c
oder 1
234
add r6, r2, r5
,
add r6, r2, r5
-…
de 45
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 17
[Blome et al, CASES, 2006]
Example of Fault Masking in MicroprocessorsExample of Fault Masking in Microprocessors
• Architectural/Software: incorrect state is written before it is read
Register File
written before it is read
mov r5, 8
mov r2, 4
-
Register File
01mov r5, 8
mov r2, 4
mov r5, 8 -4--c
oder 1
234
add r6, r2, r5
,
add r6, r2, r5
-…
de 45
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 18
[Blome et al, CASES, 2006]
Example of Fault Masking in MicroprocessorsExample of Fault Masking in Microprocessors
• Architectural/Software: incorrect state is written before it is read
Register File
written before it is read
mov r5, 8
mov r2, 4
-
Register File
01mov r5, 8
mov r2, 4
mov r5, 8 -4--c
oder 1
234
add r6, r2, r5
,
add r6, r2, r5
9…
de 45
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 19
[Blome et al, CASES, 2006]
Example of Fault Masking in MicroprocessorsExample of Fault Masking in Microprocessors
• Architectural/Software: incorrect state is written before it is read
Register File
written before it is read
mov r5, 8
mov r2, 4
-
Register File
01mov r5, 8
mov r2, 4
mov r5, 8 -
--c
oder 1
234
add r6, r2, r5
,
add r6, r2, r54
8…
de 45
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 20
[Blome et al, CASES, 2006]
Outline
• Introduction: concepts and definitions• Introduction: concepts and definitions• Motivation: new challenges imposed by future
technologies• Radiation induced faults: the major challengesRadiation induced faults: the major challenges• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced faults
mitigation techniquesmitigation techniques• Recent solutions working at different abstraction
l l t d l ith t i t f ltlevels to deal with transient faults• Conclusions
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 21
Conclusions
Motivation: Future Technologies
• The good news:
g
☺• The good news:
o Smaller devices ☺o Smaller devices→ Denser circuits, less area
☺o Faster devices
→ Higher performance
o Less power consumption→ Longer battery life (portable systems)→ Longer battery life (portable systems)
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 22
Motivation: Future Technologies
• The bad news:
g
• The bad news:
o Higher defect rates→ Lower yield→ Lower yield
o Higher sensitivity to radiation→ Increased SER: combinational logic→ Increased SER: combinational logic→ Multiple simultaneous faults→ Long duration transients
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 23
Outline
• Introduction: concepts and definitions• Introduction: concepts and definitions• Motivation: new challenges imposed by future
technologies• Radiation induced faults: the major challengesRadiation induced faults: the major challenges• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced faults
mitigation techniquesmitigation techniques• Recent solutions working at different abstraction
l l t d l ith t i t f ltlevels to deal with transient faults• Conclusions
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 24
Conclusions
Major Challengesj g
• Long Duration Transients (LDTs)• Long Duration Transients (LDTs)Different paces in transient widths vs. device speed scaling will lead to transient pulses lasting longer than cycle times of circuits. o ge a cyc e es o c cu sTemporal redundancy techniques will not cope.
• Multiple Simultaneous FaultsMultiple Simultaneous FaultsSmaller distances between devices will allow a i l ti l t ff t th d isingle particle to affect more than one device.
The single fault model will fail.
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 25
Transient width studiesTransient width studies
DODD, 2004 FERLET-CAVROIS, 2006
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 26
Propagation delay(*) vs. TechnologiesPropagation delay vs. Technologies
Technology (nm) 180 130 90 32 180/32
10-inverter chain 508.4 157.8 120.2 79.6 6.39
in out
clk clk
32 nm32 nm
90 nm
130 nm
180 nm
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 27
(*) simulated using parameters from PTM web site and HSPICE tool
Transient widths vs. Propagation delaysTransient widths vs. Propagation delays
Cycle time and transient width scaling across technologiesCycle time and transient width scaling across technologies
600 Transientidth li
500
) Width 20MeV
width scaling:max. 1.37 x
300
400
time
(ps) Width 20MeV
Width 10MeVCycle 10 InvCycle 8 Inv
6.39 x
200
300
Cyc
le Cycle 8 Inv
Cycle 6 InvCycle 4 Inv
(*)
100
0180nm 130nm 100nm 90nm 70nm 32nm
Technology
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 28
(*) 180, 130, and 100nm from [DODD, 2004], 70 nm from [Ferlet-Cavrois 2006]
Single event, multiple effects[Rossi 2005 *]
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 29
[*] Multiple Transient Faults in Logic: An Issue for Next Generation ICs ?, Daniele Rossi et al, DFT 2005
Outline
• Introduction: concepts and definitionsp• Motivation: new challenges imposed by future
technologiestechnologies• Radiation induced faults: the major challengesj g• Existing mitigation techniques vs. the new
scenarioscenario• Desired properties of new radiation induced faults
mitigation techniques• Recent solutions working at different abstraction• Recent solutions working at different abstraction
levels to deal with transient faults
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 30
• Conclusions
LDT Effects on Temporal RedundancyLDT Effects on Temporal Redundancy
• Time Redundancy [Anghel et al, 2000]
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 31
LDT Effects on Temporal RedundancyLDT Effects on Temporal Redundancy
• Time Redundancy [Anghel et al, 2000]
Increase delay ?⇒ Higher performance⇒ g e pe o a ce
penalty !!!
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 32
LDT Effects on Space RedundancyLDT Effects on Space Redundancy
• Space Redundancy [Nieuwland et al, 2006]
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 33
LDT Effects on Space RedundancyLDT Effects on Space Redundancy
• Space Redundancy [Nieuwland et al, 2006]
Can not copepwith long duration
transients !!!
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 34
LDT Effects on Space RedundancyLDT Effects on Space Redundancy
- DMR can cope with LDTs affecting one of the modules
- allows detection only requires recomputationallows detection only, requires recomputation
- area and power overheads above 100% (too much for ES)
k i t tLuigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 35
- weak point: comparator
LDT Effects on Space RedundancyLDT Effects on Space Redundancy
- TMR can cope with LDTs affecting one of the modules
- allows detection and correctionallows detection and correction
- area and power overheads above 200% (too much for ES)
k i t tLuigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 36
• It is an interesting open problem• It is an interesting open problem.• If forecasts of greatly increased fault rates
come to pass, error detection schemes targeting single error scenarios may betargeting single error scenarios may be insufficient.
• Most of current schemes assume a single error scenario.e o sce a o
• Some existing schemes may do well, but th lt d t ti th tthere are no results demonstrating that capability.
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 37
p y[*] Fault Tolerant Computer Architecture, Daniel J. Sorin, Morgan & Claypool, 2009
Multiple Effects vs. Space RedundancyMultiple Effects vs. Space Redundancy
- DMR: what if a single particle affects two modules ?
different output bits affected (O O ) → OK- different output bits affected (O1i, O2j) → OK
- same output bit affected (O1k, O2k)→ PROBLEM ! Comparator will not detect error
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 38
→ PROBLEM ! Comparator will not detect error
Multiple Effects vs. Space RedundancyMultiple Effects vs. Space Redundancy
- TMR: what if a single particle affects two modules ?
different output bits affected (O O ) → no majority !- different output bits affected (O1i, O2j) → no majority !
- same output bit affected (O1k, O2k)→ EVEN WORSE → Voter will select erroneous output !
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 39
→ EVEN WORSE → Voter will select erroneous output !
Outline
• Introduction: concepts and definitions• Introduction: concepts and definitions• Motivation: new challenges imposed by future
technologies• Radiation induced faults: the major challengesRadiation induced faults: the major challenges• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced
faults mitigation techniquesfaults mitigation techniques• Recent solutions working at different abstraction
l l t d l ith t i t f ltlevels to deal with transient faults• Conclusions
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 40
Conclusions
AnalysisAnalysis
Currently known mitigation techniques based on• Currently known mitigation techniques based on temporal redundancy can not cope with LDTs.
• Space redundancy based mitigations techniques:y g- able to cope with LDTs; - may fail when subject to multiple faults; y j p ;- impose very high area and power overheads;- not suited for the Embedded Systems arenanot suited for the Embedded Systems arena.
• The development of new low cost techniques to• The development of new low cost techniques to face those new challenges is mandatory.
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 41
Desired properties of new approachesDesired properties of new approaches
T l t LDT d lti l• Tolerance to LDTs and multiple simultaneous faults.
• Error detection area overhead << DMRError detection area overhead << DMR
• Error correction area overhead << TMR• Error correction area overhead << TMR
L f h d• Low performance overhead
• Additional concern for Embedded Systems:low power consumption
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 42
low power consumption
Suggested approachSuggested approach
Work at higher abstraction levels with low cost
System LevelAlgorithm LevelAlgorithm Level
Architecture LevelCi it L lCircuit Level
Component LevelTechnology Level
“Computer users do not notice if a transistor failsComputer users do not notice if a transistor failsor a bit of SRAM is flipped by a cosmic ray;
h i h h i h”Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 43
they notice when their programs crash” [Sorin, 2009]
Outline
• Introduction: concepts and definitions• Introduction: concepts and definitions• Motivation: new challenges imposed by future
technologies• Radiation induced faults: the major challengesRadiation induced faults: the major challenges• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced faults
mitigation techniquesmitigation techniques• Recent solutions working at different
b t ti l l t d l ith t i t f ltabstraction levels to deal with transient faults• Conclusions
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 44
Conclusions
Recently proposed solutions (1 of 6)Recently proposed solutions (1 of 6)
Working at circuit level with low cost to cope with increased SER in combinational logicwith increased SER in combinational logic
System LevelAlgorithm Level
Architecture LevelCircuit Level Combinational
H iC cu t e eComponent LevelTechnology Level
Hamming
Technology Level
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 45
SER evolution[*]SER evolution
[*] Baumann, R., “Soft Errors in Advanced Computer Systems”, IEEE Design and Test of Computers, vol. 22, no. 3,
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 46
Vector r: random 0’s and 1’sVector r: random 0 s and 1 s
[*] Freivalds, R. 1979. Fast probabilistic algorithms. In Mathematical Formulations of CS. Lecture Notes in Computer Science vol 74 Springer Verlag New York pp 57 69
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 61
Lecture Notes in Computer Science, vol. 74. Springer-Verlag, New York, pp. 57–69.
• The main difference w. r. t. the Freivalds’ technique is that here the r Vector has only 1’s. q y
• This means that to calculate Ar and Cr only• This means that to calculate Ar and Cr only additions are needed, no multiplications.
• The computational cost of verification is pthereby significantly decreased.
[*] Lisbôa, C. A., Erigson, M. I., and Carro, L., “System level approaches for mitigation of long durationtransient faults in future technologies”, in Proceedings of the 12th IEEE European Test Symposium -
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 64
ETS 2007, pp. 165-170, IEEE Computer Society, Los Alamitos, CA, May 2007.
• compute vectors Br and BrT (only sums)• compute vectors Br and BrT (only sums)
B11 B12 B1n Br1...
⇒B21 B22 B2n Br2
... ... ...
...
... ...
Σ
Σ
⇒Bn1 Bn2 Bnn Brn...
⇒Σ
BrT1 BrT
2 BrTn...
⇒
[*] Lisboa, C.; Argyrides, C.; Pradhan, D.; and Carro, L., “Algorithm Level Fault Tolerance: a Technique to Cope with Long Duration Transient Faults in Matrix Multiplication Algorithms” in Proceedings of the 26th
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 66
Cope with Long Duration Transient Faults in Matrix Multiplication Algorithms , in Proceedings of the 26IEEE VLSI Test Symposium (VTS 2008), San Diego, CA, USA, April 2008.
• compute vectors Br and BrT (only sums)• compute vectors Br and BrT (only sums)• compute vectors ABr = A × Br and ABrT = A × BrTp• compute vectors Cr and CrT (only sums)
C11 C12 C1n Cr1...
⇒C21 C22 C2n Cr2... Σ
⇒Cn1 Cn2 Cnn Crn
... ... ......
...
...
ΣCrT CrT CrT
⇒
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 68
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 77
Recently proposed solutions (3 of 6)Recently proposed solutions (3 of 6)
Working at algorithm level with low costfor runtime error detectionfor runtime error detection
System LevelAlgorithm Level
Using Invariantsfor Runtime Error
D iArchitecture LevelCircuit Level
Detection
C cu t e eComponent LevelTechnology LevelTechnology Level
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 78
GoalGoal
• Achieve tolerance to long duration transient pulsestransient pulses
l i h i l l• at algorithmic level
• with low performance overhead
• in an automatic fashion
• generalized to other algorithms
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 79
g g
Alternative approachesAlternative approaches
• Software based error detection techniques
• Duplication with Comparison: increases memory usage and execution time. [Rebaudengo et al, 1999]g [ g , ]
• Self Checking Block Signatures: imposes coding and performance penalties. [Goloubeva et al, 2003]
U f bj t i t d l d lib i i• Use of object oriented languages and libraries in some approaches leads to increased memory f t i t d i d difi tifootprint and requires source code modification. [Benso, 2005]
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 80
Alternative approachesAlternative approaches
• An algorithm level technique is proposed in• An algorithm level technique is proposed in [Lisboa, 2007] for matrix multiplication hardening• Far less computational cost than recompute and
compare (32x32 matrix – only 4.97% time increase).p ( y )
• Explores algorithm properties: conditions that hold after the execution of the algorithm known asafter the execution of the algorithm - known as program invariants or post conditions - are checked.
IDEA
Use algorithm properties as a mean forrun time error detection
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 81
run-time error detection.
Subject techniqueSubject technique
• Invariants
• Properties that always hold during program execution:
• Pre-conditionsP t diti• Post-conditions
• Loop invariantsp
• Usually used in the software engineering arena,h k if f i kto check if a program performs its tasks as
expected after maintenance.
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 82
Subject techniqueSubject technique• Daikon Tool [Ernst et al 2001]Daikon Tool [Ernst et al, 2001]
• Automatically detects potential invariants for a given programprogram.
• Identification of a testable set of invariants feasible for small programs.
• Linear relationships between up to 3 variables.• Low support to complex data structures.
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 83
Low support to complex data structures.
MethodologyMethodology
• Fault injection campaigns• Fault injection campaigns• Main program is divided into smaller, less complex,
pieces of code.
• Daikon is used to extract the invariants of each part.Daikon is used to extract the invariants of each part.
• Verification code is appended after the algorithm code.main(){
}
ProgramBody
main(){Program Slice
Program Slice
InvariantDetector
decompose
}
}
g
Program Slice
IncludeVerification
C d
Invariants
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 84
Code
MethodologyMethodology
• Fault coverage and performance evaluation
PerformanceFault CoverageEvaluation
ModifiedCode
• Fault coverage and performance evaluation
PerformanceEvaluation
Evaluation
main(){Program Slice
Verification
GenerateReference
Random FaultSetup
FaultInjection
2
1
Verification
Program SliceVerification
Setup
Ch k3 4
2
Program Slice
Verification
}
Program SliceVerification
No
CheckDetection 5
F times?
TimingReport
Yes
6AnalysisReport
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 85
Report6Report
MethodologyMethodology
• Reference and execution results are compared.
• Comparison of results is confronted with verification flagverification flag.
Statistical analysis with report generation• Statistical analysis with report generation.
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 86
Experimental results and analysisExperimental results and analysis
• The subject methodology was applied to a test• The subject methodology was applied to a test program, split into 5 code pieces:
• Evaluation of the Baskara formula ( domain ).
• Iterative integer multiplication.
• Conditional statement execution.
• Arithmetic expression evaluation.
• Square root calculation.
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 87
Experimental results and analysisExperimental results and analysis/* mult() */while(k2>0){( ){
if ((k2%2)==0 ){k2/=2;x2+=x2;
}else{
Test case program/* baskara() */x1=-1.1;x2=-1.1;if (a==0 && b!=0){
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 95
Data-oriented ApproachesData oriented Approaches
• Provide a solution for tolerating the effects of• Provide a solution for tolerating the effects of faults affecting the data program manipulates
• Introduced by Rebaudengo Politecnico diSWIFT• Introduced by Rebaudengo, Politecnico di
Torino, Italy• Used for hardening any operation among
variables• Based on automatic algorithm-level
modifications that introduce informationmodifications that introduce information (duplication code) and time redundancies
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 96
[Violante, M. Politecnico di Torino, 2006]
SWIFTSWIFT
Basic principle:Basic principle:• Each variable must be replicated two times
E h ti i bl t b li t d• Each operation among variables must be replicatedtwo times
• After every usage of a variable, its two replicas must be checked for consistency
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 97
SWIFTSWIFT
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 98
[Violante, M. Politecnico di Torino, 2006]
SWIFTSWIFT
Success stories:Success-stories:• Motorola 68040oto o a 680 0• Intel 8051• IBM PowerPC
G i l LEON1/LEON2• Gaisler LEON1/LEON2Fault models:Fault models:• SEUs• SETs
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 99
[Violante, M. Politecnico di Torino, 2006]
ED4IED I
• Introduced by McCluskey Stanford University USA• Introduced by McCluskey, Stanford University, USA• Used for hardening any operation among variables• Based on algorithm-level modifications that
Introduces time redundancies (replicated with shifted operands)
Basic principle:C t l ti S f( )• Compute one solution S=f(x)
• Compute a shifted solution S’=f(x.k)p ( )• Verify whether S and S’ are consistent
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 100
[Violante, M. Politecnico di Torino, 2006]
ED4IED I
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 101
[Violante, M. Politecnico di Torino, 2006]
ED4IED I
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 102
[Violante, M. Politecnico di Torino, 2006]
ED4IED I
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 103
• Provide a solution for tolerating the effects of• Provide a solution for tolerating the effects of faults affecting the programs’ execution flow
Control Flow Errors
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 104
Control Flow ErrorsControl Flow Errors
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 105
[Violante, M. Politecnico di Torino, 2006]
ECCAECCA• Introduced by Abraham University of Texas USAIntroduced by Abraham, University of Texas, USA• Used for detecting contro-flow errors
Based on:• Modifications to the program source code• Trigger of division-by-zero exception for error detection
Basic approach:Basic approach:• Assign an odd signature to each program’s basic block• Maintain run-time signature with the currently executed basic blockg y• While entering a basic block, set the run-time signature according to
the current basic block and check the correctness of the flow• While exiting a basic blocks, set the run-time signature according to
the next basic block
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 106
[Violante, M. Politecnico di Torino, 2006]
ECCAECCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 107
[Violante, M. Politecnico di Torino, 2006]
ECCAECCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 108
[Violante, M. Politecnico di Torino, 2006]
ECCAECCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 109
[Violante, M. Politecnico di Torino, 2006]
ECCAECCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 110
[Violante, M. Politecnico di Torino, 2006]
CFCSSCFCSS• Introduced by McClusckey Stanford University USAIntroduced by McClusckey, Stanford University, USA• Used for detecting control-flow errors
B dBased on:• Modifications to the program source code
U l i ti t t k t l fl ti• Use logic operations to track control-flow execution
Basic approach:• Assign a signature to each program’s basic block• During program execution, a run-time signature is continuously
updated• While entering a basic block:
• The run-tine signature is updated• The consistency of the run-time signature with a pre-defined one
is evaluated
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 111
is evaluated[Violante, M. Politecnico di Torino, 2006]
CFCSSCFCSS
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 112
[Violante, M. Politecnico di Torino, 2006]
CFCSSCFCSS
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 113
[Violante, M. Politecnico di Torino, 2006]
CFCSSCFCSS
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 114
[Violante, M. Politecnico di Torino, 2006]
CFCSSCFCSS
• Low cost techniques:• Low-cost techniques:• Logic operations are not time consuming• Few operations are added, resulting in low code
penalty
• Error detection is very critical: it changes the• Error detection is very critical: it changes the program’s graph by introducing a jump
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 115
[Violante, M. Politecnico di Torino, 2006]
YACCAYACCA
• Introduced by MassimoViolante• Introduced by MassimoViolante, Politecnico di Torino, Italy
• Used for detecting control-flow errors
Based on:Based on:• Modifications to the program source codep g• Use logic operations to track control-flow
tiexecution
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 116
[Violante, M. Politecnico di Torino, 2006]
YACCAYACCA
Basic principle:Basic principle:• Two signatures are assigned to each program’s
b i bl k ( t d it i t B 1 B 2)basic block (enter and exit signatures, Bx1, Bx2)• A run-time signature is constantly updated• When entering a basic block:
• Check the correctness of the execution• Check the correctness of the execution• Set the run-time signature to the enter one
• When exiting a basic block:• Check the correctness of the executionCheck the correctness of the execution• Set the run-time signature to the exit one
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 117
[Violante, M. Politecnico di Torino, 2006]
YACCAYACCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 118
[Violante, M. Politecnico di Torino, 2006]
YACCAYACCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 119
[Violante, M. Politecnico di Torino, 2006]
YACCAYACCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 120
[Violante, M. Politecnico di Torino, 2006]
YACCAYACCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 121
[Violante, M. Politecnico di Torino, 2006]
YACCAYACCA
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 122
[Violante, M. Politecnico di Torino, 2006]
YACCAYACCA
• Low cost techniques:• Low-cost techniques:• Logic operations are not time consuming• Few operations are added, resulting in low
code penanltycode penanlty• The program’s graph is not modified
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 123
[Violante, M. Politecnico di Torino, 2006]
ComparisonComparison
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 124
[Violante, M. Politecnico di Torino, 2006]
Some figuresSome figures
• Experimental setup• Experimental setup• Matrix multiplication programat u t p cat o p og a• Intel 8051 processor• Hardware-accelerated fault injection in:
C d t• Code segment• Data segmentData segment• Processor’s registers
• SEU fault model
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 125
[Violante, M. Politecnico di Torino, 2006]
Some FiguresSome Figures
• System failures due to SEUs in the• System failures due to SEUs in thecode segment:code seg e t• Un-hardened program: 1.0• ABFT: 4x better
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 126
[Violante, M. Politecnico di Torino, 2006]
Some FiguresSome Figures
• System failures due to SEUs in the• System failures due to SEUs in thedata segment:data seg e t• Un-hardened program: 1.0• ABFT: 6x better
ED4I 29 b tt• ED4I: 29x better• SWIFT+YACCA: ∞ better (0 systemSWIFT+YACCA: ∞ better (0 system
failures observed)
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 127
[Violante, M. Politecnico di Torino, 2006]
Some FiguresSome Figures
• System failures due to SEUs in the• System failures due to SEUs in the processor’s registers:• Un-hardened program: 1.0• ABFT: 9x better• ED4I: 13x better• ED4I: 13x better• SWIFT+YACCA: 15x better
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 128
[Violante, M. Politecnico di Torino, 2006]
Some FiguresSome Figures
• Time increase:• Time increase:• Un-hardened program: 1.0• ABFT: 3.8x• ED4I : 1.9xED4I : 1.9x• SWIFT+YACCA: 3.5x
C d i• Code increase:• Un-hardened program: 1.0• ABFT: 2.3x• ED4I : 1 6x• ED4I : 1.6x• SWIFT+YACCA: 3.9x
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 129
[Violante, M. Politecnico di Torino, 2006]
Some FiguresSome Figures
• Data increase:• Data increase:• Un-hardened program: 1.0• ABFT: 2.0x• ED4I: 1 9x• ED4I: 1.9x• SWIFT+YACCA: 2.2x
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 130
[Violante, M. Politecnico di Torino, 2006]
Hybrid SIFTHybrid SIFT
• Software only SIFT may introduce unacceptable• Software-only SIFT may introduce unacceptabletime penalty
• Moving in hardware some tasks may reduce this overhead
• Masking, detection, location, and recovery implemented in software and in hardwareimplemented in software and in hardware
• Possible approaches:• Lockstep execution• Watchdogs• Watchdogs• Lightweight watchdogs
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 131
Recently proposed solutions (5 of 6)Recently proposed solutions (5 of 6)
Working at system (software and hardware) level
System LevelSWAT
SoftWare AnomalyT t t
Algorithm LevelArchitecture Level
Treatment
Architecture LevelCircuit Level
Component LevelComponent LevelTechnology Level
Li, M.-L.; Ramachandran, P.; Sahoo, S. K.; Adve, S.; Adve, V.; and Zhou, Y. Understanding thepropagation of hard errors to software and implications for resilient system design. In Proc. of the
13th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2008.
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 132
13 Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2008.
Main conceptsMain concepts
• Detection of errors when they affect software behavior is• Detection of errors when they affect software behavior is preferable to detection at hardware level
• SWAT exploits this concept to achieve low cost error detection for cores at software level, by checking:o Fatal exceptionso Program crashes or hangso Program crashes or hangso Unusually high amount of operating system activity
• Some hardware errors that do not manifest themselves in software behaviors are not detected by SWAT
• SWAT suffers from the drawbacks of high level error detection mechanisms that will be discussed later
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 133
de ec o ec a s s a be d scussed a e
Recently proposed solutions (6 of 6)Recently proposed solutions (6 of 6)
Working at lower levels to detect errorsand at higher system levels to correct them.
Application Layer
g yn Application Layer
Middleware/Architectural Layer
C fi bl /P i L epor
ts
gura
tio
Configurable/Programming Layer
Register/Logic Layer
Erro
r Re
Rec
onfig
Technology Layer ER
Albrecht, C.; Koch, R.; Pionteck, T.; and Glösekötter, P. Towards a Flexible Fault-TolerantSystem-on-Chip. 22th International Conference on Architecture of Computing Systems
• SoC is divided into several layers• SoC is divided into several layers
• Each layer has specific fault tolerance mechanisms:y
o Detection is cheaper at lower layers
o Correction is better performed at higher layers
• Lower layers notify upper layers when error is detected
• Upper layers send reconfiguration information to lower layers• Upper layers send reconfiguration information to lower layers according to application requirements
• Key issue: interfaces between layers to report errors and inform about needed level of reliability according to
li tiLuigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 135
application
Sample roles of layersSample roles of layers
• Technology layer• Technology layero Built-in current sensors detect transient upsetso Upper layer can configure detection capabilitieso Upper layer can configure detection capabilities
• Register/Logic layero EDAC used to harden memorieso TMR used to harden logico Upper layer can enable/disable detection mechanisms
• Configuration/Programming layer (in reconfigurable platforms)Configuration/Programming layer (in reconfigurable platforms)o Reconfiguration can be used to disable faulty moduleso Periodical relocation of active modules reduces degradation
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 136
o Periodical relocation of active modules reduces degradation
Sample roles of layersSample roles of layers• Middleware/Architectural layerMiddleware/Architectural layer
o Applies well-known redundancy techniques such as TMR at component levelat component level
o Redundant modules designed independently to allow SEU and design errors detectiong
o Test mechanisms can be used to check modules at run time
o Checkpoints can be used to allow error recovery
• Application layero Almost everything can be used to improve reliability at this
levelo Software implemented TMR, EDAC and other techniques
b dLuigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 137
can be used
Outline
• Introduction: concepts and definitions• Introduction: concepts and definitions• Motivation: new challenges imposed by future
technologies• Radiation induced faults: the major challengesRadiation induced faults: the major challenges• Existing mitigation techniques vs. the new scenario• Desired properties of new radiation induced faults
mitigation techniquesmitigation techniques• Recent solutions working at different abstraction
l l t d l ith t i t f ltlevels to deal with transient faults• Conclusions
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 138
Conclusions
ConclusionsConclusions
• New low cost mitigation techniques, providing error detection and errorproviding error detection and error correction must be developed
• Circuit level approaches can be better than TMR but still impose significant areathan TMR, but still impose significant area and power overheads
• Algorithm level mitigation is a better h b t it i h d t li dapproach, but it is hard to generalize and
automate
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 139
High level error detection: pros and consHigh level error detection: pros and cons
[Sorin, 2009][Sorin, 2009]• Checking at a higher level:
• reduces hardware costs
• reduces the number of false positives• reduces the number of false positives
• is necessary anyway for certain types of errors
• However:
id littl di ti i f ti (t d l ti )• provides little diagnostic information (type and location)
• longer and potentially unbounded error detection g p ylatency
• recovery process may be more complex
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 140
• recovery process may be more complex
Final RemarkFinal Remark
•There is NO silver bullet!•Combine hardware and software based techniques at different levels
•Leverage on specific strengths of each technique at each level.Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 141
Copy of slides available at http://www inf ufrgs br/~calisboa/IESS2009
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 142
http://www.inf.ufrgs.br/ calisboa/IESS2009
References (in order of appearance)References (in order of appearance)
• BLOME, J. A., GUPTA, S., FENG, S., and MAHLKE, S. Cost-efficient soft error protection for embedded microprocessors. In: INTERNATIONAL CONFERENCE ON COMPILERS, ARCHITECTURE AND SYNTHESIS FOR EMBEDDED SYSTEMS, CASES 2006, 2006, Proceedings… Los Alamitos, USA: IEEE Computer Society, 2006, p421-431.Proceedings… Los Alamitos, USA: IEEE Computer Society, 2006, p421 431.
• DODD, P. et al. Production and propagation of single-event transients in high-speed digital logic ics. IEEE Transactions On Nuclear Science, Los Alamitos, USA: IEEE Computer Society, 2004, v. 51, n 6 (part 2) p 3278–3284n. 6 (part 2), p.3278–3284.
• FERLET-CAVROIS. V. et al. Statistical analysis of the charge collected in SOI and bulk devicesunder heavy ion and proton irradiation—implications for digital SETs. IEEE Transactions OnNuclear Science Los Alamitos USA : IEEE Computer Society 2006 v 53 n 6 (part 1) p 3242Nuclear Science, Los Alamitos, USA : IEEE Computer Society, 2006, v. 53, n. 6 (part 1), p. 3242-3252.
• ROSSI, D. et al. Multiple transient faults in logic: an issue for next generation ICs? In: IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI SYSTEMS 20INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI SYSTEMS, 20., DFT 2005, 2005, Monterey, USA. Proceedings… Los Alamitos, USA: IEEE Computer Society, 2005, p. 352-360.
• ANGHEL, L.; NICOLAIDIS, M. Cost reduction and evaluation of a temporary faults detection technique. In.: DESIGN, AUTOMATION AND TEST IN EUROPE CONFERENCE, 2000, DATE 2000, Paris, FRA. Proceedings… New York, USA: ACM Press, 2000, p. 591-598.
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 143
References (in order of appearance)References (in order of appearance)
• NIEUWLAND, A.; JASAREVIC, S.; JERIN, G. Combinational logic soft error analysis and protection. In: IEEE INTERNATIONAL ON-LINE TEST SYMPOSIUM, 12., IOLTS 2006, Lake of Como, ITA. Proceedings… Los Alamitos, USA: IEEE Computer Society, 2006. p. 99-104.
• SORIN, D. J., Fault Tolerant Computer Architecture, Morgan & Claypool, USA : 2009
• PRADHAN, D. Fault-tolerant computer system design. Upper Saddle River, USA : Prentice-Hall, 1995.
• BAUMANN, R. Soft errors in advanced computer systems. IEEE Design and Test of Computers, New York, USA: IEEE Computer Society, 2005, v. 22, n. 3, p. 258-266.
• HAMMING, R. Error Detecting and Error Correcting Codes. The bell system technical journal, 2005, v. 26, n. 2, p. 147-160.
• ALMUHKAIZIM, S. and MAKRIS, Y., “Fault Tolerant Design of Combinational and Sequential Logic , , , g q gbased on a Parity Check Code”, in Proceedings of th 18th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT 2003), IEEE Computer Society, Los Alamitos, CA, October 2003, pp. 344-351.
• FREIVALDS, R. Fast probabilistic algorithms. In: FREIVALDS, R. Mathematical Formulations of CS. New York, USA: Springer-Verlag, 1979. p. 57-69. (Lecture Notes in Computer Science).
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 144
References (in order of appearance)References (in order of appearance)
• LISBOA, C. A., ERIGSSON, M. I., and CARRO, L. System level approaches for mitigation of long duration transient faults in future technologies. In: IEEE EUROPEAN TEST SYMPOSIUM, 12., ETS 2007, Freiburg, DEU. Proceedings… Los Alamitos, USA: IEEE Computer Society, 2007, p. 165-170.170.
• LISBOA, C.; ARGYRIDES, C.; PRADHAN, D.; and CARRO, L. Algorithm level fault tolerance: a technique to cope with long duration transient faults in matrix multiplication algorithms. In: IEEE VLSI TEST SYMPOSIUM 26 VTS 2008 San Diego USA Proceedings [S l : s n ] 2008VLSI TEST SYMPOSIUM, 26., VTS 2008, San Diego, USA. Proceedings… [S.l.: s.n.], 2008.
• LISBOA, C. et al. Invariant checkers: an efficient low cost technique for run-time transient errors detection. In: IEEE INTERNATIONAL ON-LINE TESTING SYMPOSIUM, 15., IOLTS 2009, Sesimbra POR Proceedings [S l : s n ] 2009Sesimbra, POR. Proceedings… [S.l.: s.n.], 2009.
• REBAUNDENGO, M. et al. Soft-error detection through software fault-tolerance techniques. In: IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI SYSTEMS 14 DFT1999 1999 Alb USA P di N Y k USA IEEESYSTEMS, 14., DFT1999, 1999, Albuquerque, USA. Proceedings… New York, USA: IEEE Computer Society, 1999, p. 210-218.
• GOLOUBEVA, O. et al. Soft error detection using control flow assertions. INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE, 18., 2003, Boston, USA. Proceedings…Los Alamitos, USA: IEEE Computer Society, 2003, p. 581-588.
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 145
References (in order of appearance)References (in order of appearance)
• BENSO, A. et al. PROMON: a profile monitor of software applications. In: IEEE WORKSHOP ON DESIGN AND DIAGNOSTICS OF ELECTRONIC CIRCUITS AND SYSTEMS, 8., DDECS05, Sopron, HUN. Proceedings… New York, USA: IEEE Computer Society, 2005, p. 81-86.
• [DAIKON] ERNST, M.; COCKRELL, J.; GRISWOLD, W. Dynamically discovering likely program invariants to support program evolution. IEEE Transactions on Software Engineering. New York, USA: IEEE Computer Society, 2001, v. 27, n. 2, p.99–123.
• KASTENSMIDT, F.; CARRO, L.; REIS, R. Fault-Tolerance Techniques for SRAM-Based FPGA. New York, USA: Springer. 2006, 183 p. REBAUNDENGO, M. et al. Soft-error detection through software fault-tolerance techniques. In: IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI SYSTEMS 14 DFT1999 1999 Albuquerque USA ProceedingsFAULT TOLERANCE IN VLSI SYSTEMS, 14., DFT1999, 1999, Albuquerque, USA. Proceedings…New York, USA: IEEE Computer Society, 1999, p. 210-218.
• [ABFT] HUANG, K.; ABRAHAM, J. Algorithm-based fault tolerance for matrix operations. IEEE T ti C t N Y k USA IEEE C t S i t 1984 C 33 6Transactions on Computers. New York, USA : IEEE Computer Society, 1984, v. C-33, n. 6, p. 518-528.
• [EDDI] OH, N., SHIRVANI, P. P., McCLUSKEY, E.J. EDDI: Error Detection by Duplicated Instructions. IEEE Transactions on Reliability, IEEE Reliability Society ,2002, v. 51, n. 1, p. 63-75.
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 146
References (in order of appearance)References (in order of appearance)
4 4• [ED4I] OH, N.; MITRA, S.; McCLUSKEY, E. J. ED4I: error detection by diverse data and duplicated instructions. IEEE Transactions on Computers, IEEE Computer Society, 2002, v. 51, n. 2, p. 180-199.
• [ECCA] ALKHALIFA, Z. et al. Design and evaluation of system-level checks for on-line control flow error detection. IEEE Transactions on Parallel and Distributed Systems, New York, USA: IEEE Computer Society, 1999, v. 10, n. 6, p. 627-641.
• [EDDI] OH, N., SHIRVANI, P. P., McCLUSKEY, E.J. EDDI: Error Detection by Duplicated Instructions. IEEE Transactions on Reliability, IEEE Reliability Society, 2002, v. 51, n. 1, p. 111-122.
• [YACCA], VIOLANTE, M. Dependability assurance by design. Internal report, Politecnico di Torino, Italy, available at http://www.cad.polito.it/~sonza/diistp03/lucidi/2007/03-assurance.pdf.
• [SWAT] LI M -L ; Ramachandran P ; Sahoo S K ; Adve S ; Adve V ; and Zhou Y Understanding• [SWAT] LI, M.-L.; Ramachandran, P.; Sahoo, S. K.; Adve, S.; Adve, V.; and Zhou, Y. Understanding the propagation of hard errors to software and implications for resilient system design. In Proc. of the 13th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2008.
• ALBRECHT, C. et al. Towards a Flexible Fault-Tolerant System-on-Chip. In: INTERNATIONAL CONFERENCE ON ARCHITECTURE OF COMPUTING SYSTEMS, 22., 2009, ARC 2009, Karlsruhe, GER. Proceedings… Berlin, GER: VDE Verlag GMBH, 2009, p. 83-90.
Luigi Carro IESS - Schloβ Langenargen, Germany, September 15th, 2009 147