Oct. 2007 Terminology, Models, and Measures Slide 1 Fault-Tolerant Computing Basic Concepts and Tools
Oct. 2007 Terminology, Models, and Measures Slide 1
Fault-Tolerant ComputingBasic Concepts and Tools
Oct. 2007 Terminology, Models, and Measures Slide 2
About This Presentation
Edition Released Revised Revised
First Oct. 2006 Oct. 2007
This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. © Behrooz Parhami
Oct. 2007 Terminology, Models, and Measures Slide 3
Terminology, Models, and Measures for Dependability
Oct. 2007 Terminology, Models, and Measures Slide 4
Oct. 2007 Terminology, Models, and Measures Slide 5
Impairments to Dependability
ERROR
Malfunction
Degradation
Failure
Fault
Intrusion
Hazard
Defect
Flaw
Bug
Crash
Oct. 2007 Terminology, Models, and Measures Slide 6
The Fault-Error-Failure Cycle
Schematic diagram of the Newcastle hierarchical model and the impairments within one level.
Failure
Aspect ImpairmentStructure Fault
⇓ ⇓State Error
⇓ ⇓Behavior
Includes both components and design
0 0
0Fault Correct
signal
Replaced with NAND?
Oct. 2007 Terminology, Models, and Measures Slide 7
The Four-Universe Model
Cause-effect diagram for Avižienis’ four-universe model of impairments to dependability.
Universe ImpairmentPhysical Failure
⇓ ⇓Logical Fault
⇓ ⇓Informational Error
⇓ ⇓External Crash
Oct. 2007 Terminology, Models, and Measures Slide 8
Unrolling the Fault-Error-Failure Cycle
Cause-effect diagram for an extended six-level view of impairments to dependability.
Abstraction ImpairmentComponent Defect
⇓ ⇓Logic Fault
⇓ ⇓Information Error
⇓ ⇓System Malfunction
⇓ ⇓Service Degradation
⇓ ⇓Result Failure
Low- Level
Mid- Level
High- Level
First Cycle
Second Cycle
Failure
Aspect ImpairmentStructure Fault
⇓ ⇓State Error
⇓ ⇓Behavior
Oct. 2007 Terminology, Models, and Measures Slide 9
Multilevel Model
Component
Logic
Service
Result
Information
System
Low-Level Impaired
Mid-Level Impaired
High-Level Impaired
Initial Entry
Deviation
Remedy
Legned:
Ideal
Defective
Faulty
Erroneous
Malfunctioning
Degraded
Failed
Legend:
Tolerance
Entry
Oct. 2007 Terminology, Models, and Measures Slide 10
Analogy for the Multilevel Model
An analogy for our multi-level model of dependable computing.Defects, faults, errors, malfunctions, degradations, and failures are represented by pouring water from above. Valves represent avoidance and tolerance techniques. The goal is to avoid overflow.
Wall heights represent inter-level latencies
Drain valves represent tolerance techniques
Concentric reservoirs are analogs of the six model levels, with defect being innermost
IIIIII
I I I I I I
Inlet valves represent avoidance techniques
Oct. 2007 Terminology, Models, and Measures Slide 11
Why Our Concern with Dependability?Reliability of n-transistor system, each having failure rate λ
R(t) = e–nλt
There are only 3 ways of making systems more reliable
Reduce λ
Reduce n
1.0
0.8
0.6
0.4
0.2
0.0
e–n tλ
.9999 .9990 .9900
.9048
.3679
1010 810 610 410 nt
Reduce t
Alternative:Change the reliability formula by introducing redundancy in system
Oct. 2007 Terminology, Models, and Measures Slide 12
Highly Dependable Computer Systems
Long-life systems: Fail-slow, Rugged, High-reliabilitySpacecraft with multiyear missions, systems in inaccessible locationsMethods: Replication (spares), error coding, monitoring, shielding
Safety-critical systems: Fail-safe, Sound, High-integrityFlight control computers, nuclear-plant shutdown, medical monitoringMethods: Replication with voting, time redundancy, design diversity
Non-stop systems: Fail-soft, Robust, High-availabilityTelephone switching centers, transaction processing, e-commerceMethods: HW/info redundancy, backup schemes, hot-swap, recovery
Just as performance enhancement techniques gradually migrate from supercomputers to desktops, so too dependability enhancement methods find their way from exotic systems into personal computers
Oct. 2007 Terminology, Models, and Measures Slide 13
Aspects of Dependability
RELIABILITY
Maintainability
Availability
Perform
ability
Security
Integrity
Serviceability
Testability
Safety
Robustness
Resilience
Reliability, MTTF = MTFF
Risk, consequence
Controllability,
observability
Perform
ability, M
CBFPointwise av., In
terval av.,
MTBF, MTTR
Oct. 2007 Terminology, Models, and Measures Slide 14
Concepts from Probability Theory
Cumulative distribution function: CDFF(t) = prob[x ≤ t] = ∫0 f(x)dxt
Probability density function: pdff(t) = prob[t ≤ x ≤ t + dt] / dt = dF(t) / dt
Time0 10 20 30 40 50
Time0 10 20 30 40 50
Time0 10 20 30 40 50
1.00.80.60.4
0.20.0
CDF
pdf0.050.040.030.020.010.00
F(t)
f(t)
Expected value of xEx = ∫−∞ x f(x)dx = ∑k xk f(xk)
+∞
Covariance of x and yψx,y = E [(x – Ex)(y – Ey)]
= E [x y] – Ex Ey
Variance of xσx = ∫−∞ (x – Ex)2 f(x)dx
= ∑k (xk – Ex)2 f(xk)
+∞2
Lifetimes of 20 identical systems
Oct. 2007 Terminology, Models, and Measures Slide 15
Some Simple Probability Distributions
CDF
F(x)
f(x)
1
Uniform Exponential Normal Binomial
CDF
CDF
CDF
Oct. 2007 Terminology, Models, and Measures Slide 16
Reliability and MTTFReliability: R(t)Probability that system remains in the “Good” state through the interval [0, t]
Two-state nonrepairablesystem
R(t + dt) = R(t) [1 – z(t)dt]
Hazard function
Constant hazard function z(t) = λ ⇒ R(t) = e–λt
(system failure rate is independent of its age)
R(t) = 1 – F(t) CDF of the system lifetime, or its unreliability
Exponential reliability law
Mean time to failure: MTTFMTTF = ∫ 0 t f(t)dt = ∫0 R(t)dt
+∞ +∞
Expected value of lifetime
Area under the reliability curve(easily provable)
Start state FailureUp Down
Oct. 2007 Terminology, Models, and Measures Slide 17
Failure Distributions of Interest
Exponential: z(t) = λR(t) = e–λt MTTF = 1/λ
Weibull: z(t) = αλ(λt) α–1
R(t) = e(−λt)α MTTF = (1/λ) Γ(1 + 1/α)
Erlang:MTTF = k/λ
Gamma:Erlang and exponential are special cases
Normal:Reliability and MTTF formulas are complicated
Rayleigh: z(t) = 2λ(λt)R(t) = e(−λt)2 MTTF = (1/λ) √π / 2
Discrete versionsGeometric
Binomial
Discrete Weibull
R(k) = q k
Oct. 2007 Terminology, Models, and Measures Slide 18
Comparing Reliabilities
Reliability gain: R2 / R1
Reliability difference: R2 – R1
Reliability functionsfor Systems 1/2
Reliability improv. indexRII = log R1(tM) / log R2(tM)
1.0
0.0
Time (t)
R (t)
R (t)
1
2
MTTF1MTTF2t
R (t )1
R (t )2r
T (r )1 T (r )2M
G
G G
M
M
System Reliability (R)
Mission time extensionMTE2/1(rG) = T2(rG) – T1(rG)
Mission time improv. factor:MTIF2/1(rG) = T2(rG) / T1(rG)
Reliability improvement factorRIF2/1 = [1–R1(tM)] / [1–R2(tM)]Example:[1 – 0.9] / [1 – 0.99] = 10
Oct. 2007 Terminology, Models, and Measures Slide 19
Availability, MTTR, and MTBF(Interval) Availability: A(t)Fraction of time that system is in the “Up” state during the interval [0, t]
Two-state repairable system
Availability = Reliability, when there is no repair
Availability is a function not only of how rarely a system fails (reliability) but also of how quickly it can be repaired (time to repair)
MTTF MTTF μMTTF + MTTR MTBF λ + μ
Pointwise availability: a(t)Probability that system available at time tA(t) = (1/t) ∫ 0 a(x)dxt
Steady-state availability: A = limt→∞ A(t)
A = = =Repair rate1/μ = MTTR(Will justify thisequation later)In general, μ >> λ, leading to A ≅ 1
RepairStart state
Failure
Up Down
Oct. 2007 Terminology, Models, and Measures Slide 20
System Up and Down Times
Time
Up
Down0 t
Time to first failure Time between failuresRepair time
t1 t2t'1 t'2
Short repair time implies good maintainability (serviceability)
RepairStart state
Failure
Up Down
Oct. 2007 Terminology, Models, and Measures Slide 21
Performability and MCBFPerformability: PComposite measure, incorporating both performance and reliability
P = 2pUp2 + pUp1
Simple exampleWorth of “Up2” twice that of “Up1”pUpi = probability system is in state Upit
Three-state degradable system
pUp2 = 0.92, pUp1 = 0.06, pDown = 0.02, P = 1.90 (system performance equiv. To that of 1.9 processors on average)
Performability improvement factor of this system (akin to RIF) relative to a fail-hard system that goes down when either processor fails:PIF = (2 – 2 × 0.92) / (2 – 1.90) = 1.6
Question:What is system availability here?
Repair Partial repairStart state
FailurePartial failure
Up 1 DownUp 2
Oct. 2007 Terminology, Models, and Measures Slide 22
Time
Up
Down0 tt1 t2 t'2 t'1 t3 t'3
Partial Failure
Total Failure
Partial Repair
Partially Up
System Up, Partially Up, and Down Times
Important to prevent direct transitions to the “Down” state (coverage)
MCBF
Repair Partial repairStart state
FailurePartial failure
Up 1 DownUp 2
Oct. 2007 Terminology, Models, and Measures Slide 23
Integrity and SafetyRisk: Prob. of being in “Unsafe Failed” stateThere may be multiple unsafe states, each with a different consequence (cost)
Simple analysisLump “Safe Failed” state with “Good”state; proceed as in reliability analysis
More detailed analysisEven though “Safe Failed” state is more desirable than “Unsafe Failed”, it is still not as desirable as the “Good” state; so keeping it separate makes sense
Three-state fail-safe system
For example, if a repair transition is introduced between “Safe Failed”and “Good” states, we can tackle questions such as the expected outage of the system in safe mode, and thus its availability
Safefailed
Unsafefailed
Failure
Failure
Start state
Good