Basic Concepts and Tools

Oct. 2007 Terminology, Models, and Measures Slide 1

Fault-Tolerant ComputingBasic Concepts and Tools


About This Presentation

Edition Released Revised Revised

First Oct. 2006 Oct. 2007

This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. © Behrooz Parhami


Terminology, Models, and Measures for Dependability



Impairments to Dependability

ERROR

Malfunction

Degradation

Failure

Fault

Intrusion

Hazard

Defect

Flaw

Bug

Crash


The Fault-Error-Failure Cycle

Schematic diagram of the Newcastle hierarchical model and the impairments within one level.

Failure

Aspect ImpairmentStructure Fault

⇓ ⇓State Error

⇓ ⇓Behavior

Includes both components and design

0 0

0Fault Correct

signal

Replaced with NAND?


The Four-Universe Model

Cause-effect diagram for Avižienis’ four-universe model of impairments to dependability.

Universe ImpairmentPhysical Failure

⇓ ⇓Logical Fault

⇓ ⇓Informational Error

⇓ ⇓External Crash


Unrolling the Fault-Error-Failure Cycle

Cause-effect diagram for an extended six-level view of impairments to dependability.

Abstraction ImpairmentComponent Defect

⇓ ⇓Logic Fault

⇓ ⇓Information Error

⇓ ⇓System Malfunction

⇓ ⇓Service Degradation

⇓ ⇓Result Failure

Low- Level

Mid- Level

High- Level

First Cycle

Second Cycle

Failure

Aspect ImpairmentStructure Fault

⇓ ⇓State Error

⇓ ⇓Behavior


Multilevel Model

Component

Logic

Service

Result

Information

System

Low-Level Impaired

Mid-Level Impaired

High-Level Impaired

Initial Entry

Deviation

Remedy

Legned:

Ideal

Defective

Faulty

Erroneous

Malfunctioning

Degraded

Failed

Legend:

Tolerance

Entry


Analogy for the Multilevel Model

An analogy for our multi-level model of dependable computing.Defects, faults, errors, malfunctions, degradations, and failures are represented by pouring water from above. Valves represent avoidance and tolerance techniques. The goal is to avoid overflow.

Wall heights represent inter-level latencies

Drain valves represent tolerance techniques

Concentric reservoirs are analogs of the six model levels, with defect being innermost

IIIIII

I I I I I I

Inlet valves represent avoidance techniques


Why Our Concern with Dependability?Reliability of n-transistor system, each having failure rate λ

R(t) = e–nλt

There are only 3 ways of making systems more reliable

Reduce λ

Reduce n

1.0

0.8

0.6

0.4

0.2

0.0

e–n tλ

.9999 .9990 .9900

.9048

.3679

1010 810 610 410 nt

Reduce t

Alternative:Change the reliability formula by introducing redundancy in system


Highly Dependable Computer Systems

Long-life systems: Fail-slow, Rugged, High-reliabilitySpacecraft with multiyear missions, systems in inaccessible locationsMethods: Replication (spares), error coding, monitoring, shielding

Safety-critical systems: Fail-safe, Sound, High-integrityFlight control computers, nuclear-plant shutdown, medical monitoringMethods: Replication with voting, time redundancy, design diversity

Non-stop systems: Fail-soft, Robust, High-availabilityTelephone switching centers, transaction processing, e-commerceMethods: HW/info redundancy, backup schemes, hot-swap, recovery

Just as performance enhancement techniques gradually migrate from supercomputers to desktops, so too dependability enhancement methods find their way from exotic systems into personal computers


Aspects of Dependability

RELIABILITY

Maintainability

Availability

Perform

ability

Security

Integrity

Serviceability

Testability

Safety

Robustness

Resilience

Reliability, MTTF = MTFF

Risk, consequence

Controllability,

observability

Perform

ability, M

CBFPointwise av., In

terval av.,

MTBF, MTTR


Concepts from Probability Theory

Cumulative distribution function: CDFF(t) = prob[x ≤ t] = ∫0 f(x)dxt

Probability density function: pdff(t) = prob[t ≤ x ≤ t + dt] / dt = dF(t) / dt

Time0 10 20 30 40 50

Time0 10 20 30 40 50

Time0 10 20 30 40 50

1.00.80.60.4

0.20.0

CDF

pdf0.050.040.030.020.010.00

F(t)

f(t)

Expected value of xEx = ∫−∞ x f(x)dx = ∑k xk f(xk)

+∞

Covariance of x and yψx,y = E [(x – Ex)(y – Ey)]

= E [x y] – Ex Ey

Variance of xσx = ∫−∞ (x – Ex)2 f(x)dx

= ∑k (xk – Ex)2 f(xk)

+∞2

Lifetimes of 20 identical systems


Some Simple Probability Distributions

CDF

pdf

F(x)

f(x)

1

Uniform Exponential Normal Binomial

CDF

pdf

CDF

pdf

CDF


Reliability and MTTFReliability: R(t)Probability that system remains in the “Good” state through the interval [0, t]

Two-state nonrepairablesystem

R(t + dt) = R(t) [1 – z(t)dt]

Hazard function

Constant hazard function z(t) = λ ⇒ R(t) = e–λt

(system failure rate is independent of its age)

R(t) = 1 – F(t) CDF of the system lifetime, or its unreliability

Exponential reliability law

Mean time to failure: MTTFMTTF = ∫ 0 t f(t)dt = ∫0 R(t)dt

+∞ +∞

Expected value of lifetime

Area under the reliability curve(easily provable)

Start state FailureUp Down


Failure Distributions of Interest

Exponential: z(t) = λR(t) = e–λt MTTF = 1/λ

Weibull: z(t) = αλ(λt) α–1

R(t) = e(−λt)α MTTF = (1/λ) Γ(1 + 1/α)

Erlang:MTTF = k/λ

Gamma:Erlang and exponential are special cases

Normal:Reliability and MTTF formulas are complicated

Rayleigh: z(t) = 2λ(λt)R(t) = e(−λt)2 MTTF = (1/λ) √π / 2

Discrete versionsGeometric

Binomial

Discrete Weibull

R(k) = q k


Comparing Reliabilities

Reliability gain: R2 / R1

Reliability difference: R2 – R1

Reliability functionsfor Systems 1/2

Reliability improv. indexRII = log R1(tM) / log R2(tM)

1.0

0.0

Time (t)

R (t)

R (t)

1

2

MTTF1MTTF2t

R (t )1

R (t )2r

T (r )1 T (r )2M

G

G G

M

M

System Reliability (R)

Mission time extensionMTE2/1(rG) = T2(rG) – T1(rG)

Mission time improv. factor:MTIF2/1(rG) = T2(rG) / T1(rG)

Reliability improvement factorRIF2/1 = [1–R1(tM)] / [1–R2(tM)]Example:[1 – 0.9] / [1 – 0.99] = 10


Availability, MTTR, and MTBF(Interval) Availability: A(t)Fraction of time that system is in the “Up” state during the interval [0, t]

Two-state repairable system

Availability = Reliability, when there is no repair

Availability is a function not only of how rarely a system fails (reliability) but also of how quickly it can be repaired (time to repair)

MTTF MTTF μMTTF + MTTR MTBF λ + μ

Pointwise availability: a(t)Probability that system available at time tA(t) = (1/t) ∫ 0 a(x)dxt

Steady-state availability: A = limt→∞ A(t)

A = = =Repair rate1/μ = MTTR(Will justify thisequation later)In general, μ >> λ, leading to A ≅ 1

RepairStart state

Failure

Up Down


System Up and Down Times

Time

Up

Down0 t

Time to first failure Time between failuresRepair time

t1 t2t'1 t'2

Short repair time implies good maintainability (serviceability)

RepairStart state

Failure

Up Down


Performability and MCBFPerformability: PComposite measure, incorporating both performance and reliability

P = 2pUp2 + pUp1

Simple exampleWorth of “Up2” twice that of “Up1”pUpi = probability system is in state Upit

Three-state degradable system

pUp2 = 0.92, pUp1 = 0.06, pDown = 0.02, P = 1.90 (system performance equiv. To that of 1.9 processors on average)

Performability improvement factor of this system (akin to RIF) relative to a fail-hard system that goes down when either processor fails:PIF = (2 – 2 × 0.92) / (2 – 1.90) = 1.6

Question:What is system availability here?

Repair Partial repairStart state

FailurePartial failure

Up 1 DownUp 2


Time

Up

Down0 tt1 t2 t'2 t'1 t3 t'3

Partial Failure

Total Failure

Partial Repair

Partially Up

System Up, Partially Up, and Down Times

Important to prevent direct transitions to the “Down” state (coverage)

MCBF

Repair Partial repairStart state

FailurePartial failure

Up 1 DownUp 2


Integrity and SafetyRisk: Prob. of being in “Unsafe Failed” stateThere may be multiple unsafe states, each with a different consequence (cost)

Simple analysisLump “Safe Failed” state with “Good”state; proceed as in reliability analysis

More detailed analysisEven though “Safe Failed” state is more desirable than “Unsafe Failed”, it is still not as desirable as the “Good” state; so keeping it separate makes sense

Three-state fail-safe system

For example, if a repair transition is introduced between “Safe Failed”and “Good” states, we can tackle questions such as the expected outage of the system in safe mode, and thus its availability

Safefailed

Unsafefailed

Failure

Failure

Start state

Good

Basic Concepts and Tools

Documents