Spring 2014, Apr 11 . . . Spring 2014, Apr 11 . . . ELEC 7770: Advanced VLSI Design ELEC 7770: Advanced VLSI Design (Agrawal) (Agrawal) 1 ELEC 7770 ELEC 7770 Advanced VLSI Design Advanced VLSI Design Spring 2014 Spring 2014 Soft Errors and Fault-Toleran Soft Errors and Fault-Tolerant Design Design Vishwani D. Agrawal Vishwani D. Agrawal James J. Danaher Professor James J. Danaher Professor ECE Department, Auburn University ECE Department, Auburn University Auburn, AL 36849 Auburn, AL 36849 [email protected]http://www.eng.auburn.edu/~vagrawal/COURSE/E77 70_Spr14
71
Embed
ELEC 7770 Advanced VLSI Design Spring 2014 Soft Errors and Fault-Tolerant Design
ELEC 7770 Advanced VLSI Design Spring 2014 Soft Errors and Fault-Tolerant Design. Vishwani D. Agrawal James J. Danaher Professor ECE Department, Auburn University Auburn, AL 36849 [email protected] http://www.eng.auburn.edu/~vagrawal/COURSE/E7770_Spr14. Soft Errors. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Soft ErrorsSoft Errors Soft errors are the errors caused by the Soft errors are the errors caused by the
operating environment.operating environment. They are not due to a permanent hardware fault.They are not due to a permanent hardware fault. Soft errors are intermittent or random, which Soft errors are intermittent or random, which
makes their testing unreliable.makes their testing unreliable. One way to deal with soft errors is to make One way to deal with soft errors is to make
hardware robust:hardware robust: Capable of detecting soft errorsCapable of detecting soft errors Capable of correcting soft errorsCapable of correcting soft errors Both measures are probabilisticBoth measures are probabilistic
Some Early ReferencesSome Early References J. von Neumann, “Probabilistic Logics and the Synthesis J. von Neumann, “Probabilistic Logics and the Synthesis
of Reliable Organisms from Unreliable Components,” pp. of Reliable Organisms from Unreliable Components,” pp. 329-378, 1959, in A. H. Taub, editor, 329-378, 1959, in A. H. Taub, editor, John von Neumann: John von Neumann: Collected WorksCollected Works, , Volume V: Design of Computers, Theory Volume V: Design of Computers, Theory of Automata and Numerical Analysisof Automata and Numerical Analysis, , Oxford University Press, 1963. Oxford University Press, 1963.
M. A. Breuer, “Testing for Intermittent Faults in Digital M. A. Breuer, “Testing for Intermittent Faults in Digital Circuits,” Circuits,” IEEE Trans. ComputersIEEE Trans. Computers, vol. C-22, no. 3, pp. , vol. C-22, no. 3, pp. 241-246, March 1973.241-246, March 1973.
T. C. May and M. H. Woods, “Alpha-Particle-Induces Soft T. C. May and M. H. Woods, “Alpha-Particle-Induces Soft Errors in Dynamic Memories,” Errors in Dynamic Memories,” IEEE Trans. Electron IEEE Trans. Electron DevicesDevices, vol. ED-26, no. 1, pp. 2-9, 1979., vol. ED-26, no. 1, pp. 2-9, 1979.
Interconnect coupling (crosstalk).Interconnect coupling (crosstalk). Power supply noise: IR-drop, power droop, Power supply noise: IR-drop, power droop,
ground bounce.ground bounce. Ignition noise.Ignition noise. Electromagnetic pulse (EMP).Electromagnetic pulse (EMP). Effects generally attributed to alpha-particles:Effects generally attributed to alpha-particles:
Sources of Alpha-ParticlesSources of Alpha-Particles
Radioactive contamination in VLSI packaging Radioactive contamination in VLSI packaging material.material.
Ionosphere, magnetosphere and solar radiation.Ionosphere, magnetosphere and solar radiation. Other electromagnetic radiation.Other electromagnetic radiation.
Helium nucleus: two protons and two Helium nucleus: two protons and two neutrons, mass = 6.65 neutrons, mass = 6.65 ×10×10-27-27kgkg, charge = , charge = +2e (e = 1.6 +2e (e = 1.6 ×10×10-19-19C).C).
Failures in time (FIT): One FIT is 1 error per Failures in time (FIT): One FIT is 1 error per billion hours of operation.billion hours of operation.
Alternative unit is mean time between failures Alternative unit is mean time between failures (MTBF) or mean time to failure (MTTF).(MTBF) or mean time to failure (MTTF).
Error correcting effectsError correcting effects Transient pulse is filtered by gate inertiaTransient pulse is filtered by gate inertia Transient is blocked by an unsensitized pathTransient is blocked by an unsensitized path Transient is blocked by an inactive clockTransient is blocked by an inactive clock
Error enhancing effectsError enhancing effects Large number of gates can produce multiple Large number of gates can produce multiple
pulsespulses Fanouts can multiply error pulsesFanouts can multiply error pulses
Typical Soft Error DistributionTypical Soft Error Distribution
S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” Computer, vol. 38, no. 2, pp. 43-52, February 2005.
Soft Error SimulationSoft Error Simulation
F. Wang and V. D. Agrawal, “Soft Error Rate F. Wang and V. D. Agrawal, “Soft Error Rate with Inertial and Logical Masking,” with Inertial and Logical Masking,” Proc. 22Proc. 22ndnd International Conference on Quality VLSI International Conference on Quality VLSI DesignDesign, January 2009, pp. 459-464., January 2009, pp. 459-464.
F. Wang and V. D. Agrawal, “Soft Error Rate F. Wang and V. D. Agrawal, “Soft Error Rate Determination for Nanoscale Sequential Logic,” Determination for Nanoscale Sequential Logic,” Proc. 11Proc. 11thth International Symposium on Quality International Symposium on Quality Electronic Design (ISQED), Electronic Design (ISQED), March 2010, pp. March 2010, pp. 225-230.225-230.
Parts that can be affectedParts that can be affected Look-up table (LUT)Look-up table (LUT) Configuration memory cellConfiguration memory cell Flip-flopFlip-flop Block RAMBlock RAM
F. L. Kastensmidt, L. Carro and R. Reis, F. L. Kastensmidt, L. Carro and R. Reis, Fault-Tolerant Techniques for SRAM-Based Fault-Tolerant Techniques for SRAM-Based FPGAsFPGAs, Springer, 2006., Springer, 2006.
Most soft errors in combinational logic are eliminated by Most soft errors in combinational logic are eliminated by inertial or logic masking.inertial or logic masking.
Soft error pulse generated in flip-flop is much shorter Soft error pulse generated in flip-flop is much shorter than clock period.than clock period.
Probability of either a master or slave latch being struck Probability of either a master or slave latch being struck by soft error exactly at clock edge is small.by soft error exactly at clock edge is small.
Flip-flop is duplicated and outputs fed to C-element.Flip-flop is duplicated and outputs fed to C-element. Twenty times reduction of soft error observed.Twenty times reduction of soft error observed. Ref.: S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, Ref.: S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim,
“Robust System Design with Built-In Soft-Error Resilience,” “Robust System Design with Built-In Soft-Error Resilience,” ComputerComputer, vol. 38, no. 2, pp. 43-52, February 2005., vol. 38, no. 2, pp. 43-52, February 2005.
R(t) is the probability of no error in interval [0, t].R(t) is the probability of no error in interval [0, t]. Divide interval in a large number (n) of subintervals of Divide interval in a large number (n) of subintervals of
duration t/n. Let x be the probability of error in one duration t/n. Let x be the probability of error in one subinterval.subinterval.
Assume that duration t/n is so small that either no error Assume that duration t/n is so small that either no error occurs or at most one error can occur. Then, average occurs or at most one error can occur. Then, average errors in a subinterval = 0.(1 – x) + 1.x = x = errors in a subinterval = 0.(1 – x) + 1.x = x = λλt/n.t/n.
Probability of no error in interval [0, t] is,Probability of no error in interval [0, t] is,
Example: First Generation ComputerExample: First Generation Computer
10,000 electron tubes.10,000 electron tubes. Average burn out rate: 5 tubes per 100,000 hours.Average burn out rate: 5 tubes per 100,000 hours. MTBF = 100,000/5 = 20,000 hours = 2.3 years, MTBF = 100,000/5 = 20,000 hours = 2.3 years,
i.e., 37% chance of survival beyond 2.3 years.i.e., 37% chance of survival beyond 2.3 years. Time for 95% chance of survival:Time for 95% chance of survival:
R(t) = exp(– t/MTBF) = 0.95, or t = 1.4 months R(t) = exp(– t/MTBF) = 0.95, or t = 1.4 months
Error Detection CodeError Detection Code Errors: Bits can flip due too noise in circuits and Errors: Bits can flip due too noise in circuits and
in communication.in communication. Extra bits used for error detection.Extra bits used for error detection. Example: a parity bit in ASCII codeExample: a parity bit in ASCII code
Even parity code for A 01000001(even number of 1s)
Odd parity code for A 11000001(odd number of 1s)
7-bit ASCII code
Parity bits
Single-bit error in 7-bit code of “A”, e.g., 1000101, will changesymbol to “E” or 1000000 to “@”. But error will be detected inthe 8-bit code because the error changes the specified parity.
5353
Richard W. HammingRichard W. Hamming Error-correcting codes Error-correcting codes
(ECC).(ECC). Also known forAlso known for
Hamming distance Hamming distance HD = Number of bits two HD = Number of bits two
Original code: Symbol “0” with a single-bit error will be Interpreted as“1”, “2”, “4” or “8”.
Reason: Hamming distance betweencodes is 1. A code with any bit error willmap onto another valid code.
Remedy: Design codes with HD ≥ 2.Example: Parity code. Single bit errordetected but not correctable.
Remedy: Design codes with HD ≥ 3.For single bit error correction, decodeas the valid code at HD = 1.
For more error bit detection orcorrection, design code with HD ≥ 4.
A Book on Coding TheoryA Book on Coding Theory
R. W. Hamming, R. W. Hamming, Coding and Information TheoryCoding and Information Theory, , Englewood Cliffs, New Jersey: Prentice-Hall, Englewood Cliffs, New Jersey: Prentice-Hall, 1980.1980.
Byzantine General’s ProblemByzantine General’s Problem
In a war a general needs to communicate an In a war a general needs to communicate an attack (a) or retreat (r) order to subordinates in attack (a) or retreat (r) order to subordinates in the field.the field.
For success a perfect agreement is necessary.For success a perfect agreement is necessary. Byzantine Fault:Byzantine Fault:
Subordinates can be unreliable or malicious.Subordinates can be unreliable or malicious. Communication (messengers) can be unreliable or Communication (messengers) can be unreliable or
Byzantine Resilient SystemByzantine Resilient System A system that can correctly function in presence of A system that can correctly function in presence of
Byzantine faults.Byzantine faults. Byzantine protocol for n node system:Byzantine protocol for n node system:
Any node can initiate a message broadcast.Any node can initiate a message broadcast. All nodes rebroadcast the received message to all nodes All nodes rebroadcast the received message to all nodes
it has not heard from.it has not heard from. After communications end, nodes take majority decision.After communications end, nodes take majority decision.
Ref.: L. Lamport, R. Shostak and M. Pease, “The Ref.: L. Lamport, R. Shostak and M. Pease, “The Byzantine General’s Problem,” Byzantine General’s Problem,” ACM Trans. Prog. ACM Trans. Prog. Lang. SystLang. Syst., vol. 4, no. 3, pp. 382-401, July 1982.., vol. 4, no. 3, pp. 382-401, July 1982.
In order to tolerate t failures:In order to tolerate t failures: The system must have at least 3t + 1 nodes.The system must have at least 3t + 1 nodes. There must be at least 2t +1 disjoint There must be at least 2t +1 disjoint
communication paths between nodes.communication paths between nodes. A node must exchange messages at least t +1 A node must exchange messages at least t +1
D. K. Pradhan, D. K. Pradhan, Fault-Tolerant Computer System Fault-Tolerant Computer System Design,Design, Upper Saddle River, New Jersey: Upper Saddle River, New Jersey: Prentice Hall PTR, 1996.Prentice Hall PTR, 1996.
P. K. Lala, P. K. Lala, Self-Checking and Fault-Tolerant Self-Checking and Fault-Tolerant Digital DesignDigital Design, San Francisco: Morgan-, San Francisco: Morgan-Kaufmann, 2001.Kaufmann, 2001.