Reliability
Dec 20, 2015
Transient Faults
Faults that persist for a “short” durationCause: cosmic rays, energetic particles
originating from outer space Effect: knock off electrons, discharge
capacitorSolution
no practical absorbent for cosmic rays1 fault per 1000 computers per year (estimated fault
rate)Future is worse
smaller feature size, higher transistor count, reduced noise margin
Background
Fault tolerant systems use redundancy to improve reliability: Time redundancy: separate executions Space redundancy: separate physical copies of resources
DMR/TMR Data redundancy
ECC: Automatic repeat request (ARQ) , Forward error correction (FEC)
Parity: odd/even
Examples: IBM: duplicated pipelines, spare processors, ECC in
memories... HP: DMR/TMR processors, Parity/ECC in buses, memories...
Multiprocessors: Fault Detection
Chip-level Redundantly Threaded processor Replicates register values but not memory
values The leading thread commits stores only after
checkingMemory is guaranteed to be correctOther instructions commit without checking
The leading thread sends committed values for:branch outcomesload/store valuesstore addresses
Sphere of Replication (SoR)
Logical boundary of redundant execution within a system Components within protected
via redundant execution Components outside must be
protected via other means
Its size matters: Error detection latency Stored-state size
Example Spheres of Replication
Compaq HimalayaORH-Dual: On-Chip Replicated Hardware
(similar to IBM G5)
Fault Detection via Simultaneous Multithreading (SMT)
Replicated Microprocessors + Cycle-by-Cycle Lockstepping
Concept
SMT improves the performance of a processor by: allowing independent threads to execute
simultaneously doing so in different functional units
Redundant Multithreading (RMT): leverages SMT’s properties to allow fault
detection for microprocessorsruns two copies of the same program as independent
threadscompares their outputs and initiates recovery in case
of mismatch
Input Replication
Load Value Queue (LVQ) Keep threads on same path despite I/O or MP writes Out-of-order load issue possible
Store Queue Comparator (STQ)
Store Queue Comparator Compares outputs to data cache Catch faults before propagating to
rest of system
Store Queue Comparator (cont’d)
Extends residence time of leading-thread stores Size constrained by cycle time goal Base CPU statically partitions single queue among threads Potential solution: per-thread store queues
Deadlock if matching trailing store cannot commit Several small but crucial changes to avoid this
Branch Outcome Queue (BOQ)
Branch Outcome Queue Forward leading-thread branch targets
to trailing fetch 100% prediction accuracy in absence of
faults
Simultaneous & Redundantly Threaded Processor (SRT)
SRT = SMT + Fault DetectionLess hardware compared to replicated
microprocessors SMT needs ~5% more hardware over uniprocessor SRT adds very little hardware overhead to existing
SMT
Better performance than complete replication better use of resources Lower cost
Issues
Cycle-by-cycle output comparison and input replication: Equivalent insts from different threads may
execute in different cycles Equivalent insts from different threads might
execute in different orderPrecise scheduling of the threads crucial
for optimal performanceBranch mispredictionCache miss