1 P217/MAPLD2005 Ahmadain A Study of the Impact of Temperature on FPGA-based TMR Designs Amr Ahmadain Dept. of Electrical and Computer Engineering and Computer Science University of Cincinnati Cincinnati, OH MAPLD 2005 Karen Tomko Dept. of Electrical and Computer Engineering and Computer Science University of Cincinnati Cincinnati, OH
33
Embed
P217/MAPLD2005Ahmadain 1 A Study of the Impact of Temperature on FPGA-based TMR Designs Amr Ahmadain Dept. of Electrical and Computer Engineering and Computer.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1 P217/MAPLD2005Ahmadain
A Study of the Impact of Temperature on FPGA-based TMR Designs
Amr Ahmadain
Dept. of Electrical and Computer Engineering and Computer Science
University of CincinnatiCincinnati, OH
MAPLD 2005
Karen Tomko
Dept. of Electrical and Computer Engineering and
Computer Science University of Cincinnati
Cincinnati, OH
2
MAPLD 2005
Ahmadain P217/MAPLD2005
Presentation Outline
• Overview: The Big Picture• Study Motivations• Study Objectives• Solution Approach• Study Assumptions• Derivation of the Reliability Model• Implementation of the Reliability Model• Conclusions and Future Work• References
3
MAPLD 2005
Ahmadain P217/MAPLD2005
Overview: The Big Picture
• In current high-performance sub-90 nm technology leakage power
• is rising dramatically over time with ever-shrinking feature sizes [1]
• increases exponentially with junction temperature• is also electro-thermally coupled with junction
temperature [2]
• Junction temperature in turn, leads to • exponential increase in leakage power• Exponential reduction in the Mean Time to Failure [3]
4
MAPLD 2005
Ahmadain P217/MAPLD2005
Overview: The Big Picture (cont’d.)
International Technology Roadmap for Semiconductors (ITRS) 2001, 2002.
Courtesy of: Leakage Power: Moore’s Law Meets Static Power, Computer, December 2003, IEEE Computer Society.
Static and Dynamic Power Dissipation versus physical gate length (nm)
5
MAPLD 2005
Ahmadain P217/MAPLD2005
Study Motivations
• FPGA-based Triple Modular Redundancy (TMR) designs results in
• considerable increase in total design area• reduction in circuit performance• Tripling the total power dissipation [4]
• Tripling the static power dissipation fires back at• junction temperature• Junction temperature in turn, fires back at reliability
6
MAPLD 2005
Ahmadain P217/MAPLD2005
Study Motivations (cont’d.)• Recent work has been done to alleviate the cost of a full TMR
design
• Selective TMR (STMR) technique which applies TMR only to sensitive sub-circuits [5].
• Partial TMR which applies TMR only to sensitive design components [6]
• The above work indicates an increased awareness that a full TMR-based design is not always the “perfect solution”
7
MAPLD 2005
Ahmadain P217/MAPLD2005
Study Motivations (cont’d.)
• Reliability prediction methods of electronic systems such as [7], [8] have traditionally considered
• The effect of ONLY steady-state temperature• Constant failure rate during the device’s useful life• Varying failure rates during the infant mortality and wear-out
phases
Infant MortalityPhase
Wear-out Pahse
Normal (Useful) Life“Constant” Failure RateFa
ilure
Rat
e
Life Time
The Bathtub Curve
8
MAPLD 2005
Ahmadain P217/MAPLD2005
Study Motivations (cont’d.)• The assumption of constant failure rate and steady-state
temperature may cause some errors in reliability prediction [9]
• Failure rate might change even during the useful life of a device.• Temperature, itself, might also vary with time
• These errors could lead to a pessimistic prediction instead of a realistic one
• There is a need for a reliability prediction model which accurately captures the evolution of the system with time and temperature at each phase of its lifetime to avoid
9
MAPLD 2005
Ahmadain P217/MAPLD2005
Study Objectives• Phase I (Current Phase)
• Develop a time and temperature-dependent reliability model for an FPGA-based TMR design as the foundation of the prediction framework
• Phase II• Build a reliability prediction framework where the time
and temperature-dependent reliability function is predicted using real data
• Evaluate the impact of FPGA-based TMR designs on junction temperature in leakage-dominant technologies and the overall influence on system reliability
10
MAPLD 2005
Ahmadain P217/MAPLD2005
Solution Approach
• Build a non-stationary (non-homogeneous) Markov chain to model the TMR system states and transitions where the assumption of a constant failure rate and steady-state temperature can be relaxed
• A non-homogeneous Markov chain is a chain where the transition probabilities is a function of time [10]
• Calculate the TMR system reliability as a function of the failure rate
11
MAPLD 2005
Ahmadain P217/MAPLD2005
Notation
• n discrete time step• t continuous time unit• αm Weibull distribution shape parameter of a TMR module
• αv Weibull distribution shape parameter of the majority voter• R(n, s(n)) reliability function that is dependent on both time and
temperature as a function of time• z(n, s(n) hazard rate function that is dependent on both time and
temperature as a function of time• s(n) stress (temperature) as a function of time• eA Activation Energy• B (eA/ KB) parameter of the Arrhenius relationship associated with the
activation energy• C parameter of the Arrhenius relationship that depends on product
geometry, fabrication methods and other factors• pn
U one-step transition probability matrix that is dependent on the time , n
• U/D set of Up/Down States
TK
eA
refBeMTTFMTTF TK
eA
refBeMTTFMTTF
12
MAPLD 2005
Ahmadain P217/MAPLD2005
Study Assumptions
• Time to Failure of the system modules are statistically independent
• The majority voter has a different hazard rate (zv) than that assumed for the TMR modules and hence, a different (αv)
• Module failure rates are time and temperature-dependent
Module M1
Module M2
Module M3
Majority Voter
Assumed TMR configuration
13
MAPLD 2005
Ahmadain P217/MAPLD2005
Study Assumptions (contd.)
• We use the Arrhenius relationship to model the relationship between life and temperature
• The Arrhenius-Weibull distribution is assumed to be the life distribution of the TMR system modules [10],
where the Probability Distribution Function (PDF) is1
)(.
)()(.
))(,(PDF
ts
B
eC
t
ts
B
ts
B e
eC
t
eC
tstf
TK
eA
BCeMTTF
14
MAPLD 2005
Ahmadain P217/MAPLD2005
Derivation of the Reliability Model
Step 1: Definition of system states
The system defines the following three states:
• State ‘0’: System (all modules) functional• State ‘1’: One module failed and two modules functional• State ‘2’: System Failure; two modules failed or voter
failed
• States ‘0’ and ‘1’ are the Up states and state ‘2’ is the Down state
15
MAPLD 2005
Ahmadain P217/MAPLD2005
Derivation of the Reliability Model (contd.)
Step 2: Determining state transition probabilities
0 1 2
A(n) C(n)
B(n)
1- C(n)
1- [A(n) + B(n)]1.0
0 1 2
A(n) C(n)
zv(t, s(t))
Continuous-time State Transition Diagram
Discrete-time State Transition Diagram
The hazard function of the Arrhenius-Weibull distribution [11]
1
)()( ..
)(,
ts
B
ts
B
eC
t
eC
tstz
16
MAPLD 2005
Ahmadain P217/MAPLD2005
Derivation of the Reliability Model (contd.)
• Step 3: Calculation of the reliability function
The reliability of a NHDTMC at time n as given in [13] is expressed as
where
Substituting the state transition probability matrices into the above equation and carrying out the necessary matrix multiplications, yields the reliability of the system, R(n), at time n
1
01 1)(
n
kv
UkpnR
011
1
11 and
v
17
MAPLD 2005
Ahmadain P217/MAPLD2005
Implementation of the Reliability Model
• The reliability model has been implemented for three different types temperature stresses
• Steady-state temperature stress• Cyclic stress• Progressive stress
• The reliability and failure rate functions have been implemented using numerical integration techniques. The Gauss-Kronrod Quadrature method has been used [14]
18
MAPLD 2005
Ahmadain P217/MAPLD2005
Implementation of the Reliability Model (contd.)
• Experimental Setup• Experiments have been designed based on changing the
values of two sets of parameters• Stress test-related parameters.• Probability distribution-related parameters. In this study, the
parameter is the Weibull distribution shape parameter, α
• Values of stress-test related parameters have been chosen to span a minimum, typical and maximum operational stress levels
• The value of α takes the following set of values: 0.8, 1.0, 1.4, 2.0
19
MAPLD 2005
Ahmadain P217/MAPLD2005
Implementation of the Reliability Model (contd.)
Parameters and Mathematical Functions of Stress Tests
Stress Test Parameter
Parameter Values Mathematical
Function
Steady-State Stress Temperature 328, 373, 423 K T = constant
Cyclical Stress Period π, 2π, 4π 328×sin (kn)
Progressive Stress Slope 0.25, 0.5, 1.0 an + 273
Model Constants
Activation energy (eA) 0.7 eV
C 2.4x10-9
B = eA/KB 8117.82
20
MAPLD 2005
Ahmadain P217/MAPLD2005
Results: Steady-State Temperature Stress
0 2 4 6 8 10 12 14
0
0.2
0.4
0.6
0.8
1
T 423 K
T 373 K
T 328 K
0 2 4 6 8 10 12 14
Time [hours]
0
0.2
0.4
0.6
0.8
1α = 0.8
T 423 K
T 373 K
T 328 K
0 5 10 15 20
Time [hours]
0
0.2
0.4
0.6
0.8
1Temperature = 328 K
α 2.0
α 1.4
α 1.0
α 0.8
0 2 4 6 8 10
0
0.2
0.4
0.6
0.8
1
α 2.0
α 1.4
α 1.0
α 0.8
Rel
iabi
lity
Rel
iabi
lity
Rel
iabi
lity
α = 2.0
Temperature = 423 K
Time [hours]
Time [hours]
Rel
iabi
lity
21
MAPLD 2005
Ahmadain P217/MAPLD2005
Results: Cyclic Stress
0 10 20 30 40
0
0.2
0.4
0.6
0.8
1
Period 4
Period 2
Period
α = 0.8
Rel
iabi
lity
Time [hours]
0 10 20 30 40
0
0.2
0.4
0.6
0.8
1
Period 4
Period 2
Period
α = 2.0
Time [hours]
Rel
iabi
lity
0 10 20 30 40
0
0.2
0.4
0.6
0.8
1Period = 4π
α 2.0
α 1.4
α 1.0
α 0.8
Rel
iabi
lity
Time [hours]0 10 20 30 40
0
0.2
0.4
0.6
0.8
1
2.0
α 0.8
Period = π
Rel
iabi
lity
Time [hours]
α 1.0
α 1.4
22
MAPLD 2005
Ahmadain P217/MAPLD2005
Results: Progressive Stress
0 10 20 30 40
0
0.2
0.4
0.6
0.8
1
Slope 1.0
Slope 0.5
Slope 0.25
α = 0.8
Rel
iabi
lity
Time [hours]
0 10 20 30 40 50 60
0
0.2
0.4
0.6
0.8
1
Slope 1.0
Slope 0.5
Slope 0.25
α = 2.0
Time [hours]
0 10 20 30 40 50 60
0
0.2
0.4
0.6
0.8
1Slope = 0.25
2.0
1.4
1.0
0.8
Time [hours]
Rel
iabi
lity
0 10 20 30 40
Time hours
0
0.2
0.4
0.6
0.8
1
2.0
1.4
1.0
0.8
Slope = 1.0
Rel
iabi
lity
Rel
iabi
lity
23
MAPLD 2005
Ahmadain P217/MAPLD2005
Results: Reliability vs. Type of Stress Test
0 10 20 30 40 50
0
0.2
0.4
0.6
0.8
1
Slope 0.25
Period 4
T 328 K
α = 0.8
Time [hours]
Rel
iabi
lity
0 10 20 30 40 50
0
0.2
0.4
0.6
0.8
1
Slope 1.0
Period
T 423 K
α = 1.0
Time [hours]
Rel
iabi
lity
0 10 20 30 40 50
0
0.2
0.4
0.6
0.8
1
Slope 0.5
Period 2
T 373 K
α = 1.4
Time [hours]
Rel
iabi
lity
0 10 20 30 40 50
0
0.2
0.4
0.6
0.8
1
Slope 1.0
Period
T 423 K
α = 2.0
Time [hours]
Rel
iabi
lity
24
MAPLD 2005
Ahmadain P217/MAPLD2005
Conclusions
• Reliability varies greatly as a result of applying different types of stress tests
• The assumption of a constant failure rate and steady-state temperature may lead to errors in reliability prediction overly conservative decisions
• The value of the Weibull distribution shape parameter (α) has a negligible effect on system reliability for all types of stress tests
• The stress test-related parameters (temperature, period and slope) have a visible impact on system reliability for all types of stress tests
25
MAPLD 2005
Ahmadain P217/MAPLD2005
Future Work
• Validate the proposed reliability model by performing detailed simulations of the physics of specific failure mechanisms and evaluate the overall impact on the system reliability
• Model the FPGA partial reconfiguration process using a Markov chain with repair
• Evaluate the combined effects of temperature overstress and radiation at both the system and transistor level
26
MAPLD 2005
Ahmadain P217/MAPLD2005
Backup Slides: Derivation of the Reliability Model
• Approximate the continuous-time Markov Chain (CTMC) by a discrete-time Markov Chain (DTMC)
For this step, the two-step technique given in [12] is used
• Convert the CTMC to a DTMC
• Approximate the continuous-time hazard function to a discrete-time hazard function
27
MAPLD 2005
Ahmadain P217/MAPLD2005
Backup Slides: Derivation of the Reliability Model
0 1 2
A(n) C(n)
B(n)
1- C
(n)
Discrete-time State Transition Diagram
where A(n), B(n), and C(n) are given in terms of the hazard function as follows
Backup Slides: Derivation of the Reliability Model
• To approximate the continuous-time hazard function to a discrete-time hazard function, we use the Probability Mass Function (PMF) of the discrete Weibull Distribution as given in [12] is expressed as
The PMF can equivalently be given as
f(n, s(n))= R(n) – R(n+ 1)where R(n) is the reliability function given as
)1()(, nkn qqnsnfPMF
)(.)(,ns
B
eC
n
ensnR
29
MAPLD 2005
Ahmadain P217/MAPLD2005
Backup Slides: Derivation of the Reliability Model
By substituting the reliability function into the PMF, we get
where now the q in
is given by
)1(
.. )()(
)(,
n
eC
n
n
eC
n
ns
B
ns
B
eensnfPMF
)1()(, nn qqnsnfPMF
)(. ns
B
eC
n
eq
30
MAPLD 2005
Ahmadain P217/MAPLD2005
Backup Slides: Derivation of the Reliability Model
• Build the state transition matrix using the approximated discrete-time state transition probabilities
The transition probability matrix of the NHDTMC as given in [13] is given as
Dn
DUn
UDn
Un
npp
ppp
UD
D
U
31
MAPLD 2005
Ahmadain P217/MAPLD2005
References
1. N.S. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flautner, J. Hu, M.J. Irvin, M. Kandemir, and V. Narayanan, “Leakage Current: Moore’s Law Meets Static Power,” Computer, Vol. 36(12), Dec. 2003, pp.68-75.
2. K. Banerjee, S. Lin, A. Keshavarzi, S. Narendra, and V. De, “A Self-Consistent Junction Temperature Estimation Methodology For Nanometer Scale ICs with Implications for Performance and Thermal Management”, Technical Digest of the IEEE International Electron Devices Meeting (IEDM’03), 2003, pp. 36.7.1-36.7.4
3. P. Lall, M.G. Pecht and E.B. Hakim, “Influence of Temperature on Microelectronics and System Reliability”, CRC Press LLC, 1997.
4. N. Rollins, M.J. Wirthlin, P.S. Graham, “Evaluation of Power Costs in Applying TMR to FPGA Designs”, Proceedings of the 7th Annual Military and Aerospace Programmable Logic Devices International Conference (MAPLD), Sept., 2004.
32
MAPLD 2005
Ahmadain P217/MAPLD2005
References (contd.)5. P.K. Samurdrala, J. Ramos, and S. Katkoori, “Selective Triple Modular
Redundancy for SEU Mitigation on FPGAs”, Proceedings of the 6th Annual Military and Aerospace Programmable Logic Devices International Conference (MAPLD), Sept., 2003.
6. B. Patt, D.E. Johnson, M.J. Wirthlin, M. Caffrey, K. Morgan, and P. Graham, “Improving FPGA Design Robustness with Partial TMR”, Proceedings of the 8th Annual Military and Aerospace Programmable Logic Devices International Conference (MAPLD), Sept., 2005. To Be Published.
7. U.S. Department of Defense, Reliability Prediction of Electronic Equipment, MIL-HDBK 217F, Washington, D.C., 1991.
8. Siemens, SN29500 Reliability and Quality Specification Failure Rates of Components, 1986.
9. A. Mettas, P. Vassiliou, “Modeling and Analysis of Time-Dependent Stress Accelerated Life Data”, Proceedings of the 2002 Annual Reliability and Maintainability Symposium (RAMS), Jan., 2002, pp. 343-348.
33
MAPLD 2005
Ahmadain P217/MAPLD2005
References (contd.)
10. W. Nelson, “Accelerated Testing: Statistical Models, Test Plans and Data Analyses”, John Wiley & Sons, 1990.
11. ReliaSoft, “Accelerated Life Testing Reference”, [Online book], 2001, Available at HTTP: http://www.weibull.com/acceltestwebcontents.htm
12. D.P. Siewiorek and R.S. Swarz, “Reliable Computer Systems”, Digital Press, 1992.
13. A. Platis, N. Limnois, and M.L. Du, “Hitting Time in a Finite Non-Homogeneous Markov Chain with Applications”, Journal of Applied Stochastic Models and Data Analysis, Vol. 14(3), 1998, pp. 241-253.
14. Wolfram Research, "Gauss-Kronrod Quadrature", [Online document], Available at HTTP: http://mathworld.wolfram.com/Gauss-KronrodQuadrature.html