Top Banner
Reliability
17

Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Reliability

Page 2: Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Threads for Fault Tolerance

Multiprocessors: Transient fault detection

Page 3: Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Transient Faults

Faults that persist for a “short” durationCause: cosmic rays, energetic particles

originating from outer space Effect: knock off electrons, discharge

capacitorSolution

no practical absorbent for cosmic rays1 fault per 1000 computers per year (estimated fault

rate)Future is worse

smaller feature size, higher transistor count, reduced noise margin

Page 4: Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Background

Fault tolerant systems use redundancy to improve reliability: Time redundancy: separate executions Space redundancy: separate physical copies of resources

DMR/TMR Data redundancy

ECC: Automatic repeat request (ARQ) , Forward error correction (FEC)

Parity: odd/even

Examples: IBM: duplicated pipelines, spare processors, ECC in

memories... HP: DMR/TMR processors, Parity/ECC in buses, memories...

Page 5: Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Multiprocessors: Fault Detection

Chip-level Redundantly Threaded processor Replicates register values but not memory

values The leading thread commits stores only after

checkingMemory is guaranteed to be correctOther instructions commit without checking

The leading thread sends committed values for:branch outcomesload/store valuesstore addresses

Page 6: Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Sphere of Replication (SoR)

Logical boundary of redundant execution within a system Components within protected

via redundant execution Components outside must be

protected via other means

Its size matters: Error detection latency Stored-state size

Page 7: Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Example Spheres of Replication

Compaq HimalayaORH-Dual: On-Chip Replicated Hardware

(similar to IBM G5)

Page 8: Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Fault Detection in Compaq Himalaya System

Replicated Microprocessors + Cycle-by-Cycle Lockstepping

Page 9: Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Fault Detection via Simultaneous Multithreading (SMT)

Replicated Microprocessors + Cycle-by-Cycle Lockstepping

Page 10: Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Concept

SMT improves the performance of a processor by: allowing independent threads to execute

simultaneously doing so in different functional units

Redundant Multithreading (RMT): leverages SMT’s properties to allow fault

detection for microprocessorsruns two copies of the same program as independent

threadscompares their outputs and initiates recovery in case

of mismatch

Page 11: Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Input Replication

Load Value Queue (LVQ) Keep threads on same path despite I/O or MP writes Out-of-order load issue possible

Page 12: Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Output Comparison

Compare & validate output before sending it outside the SoR

Page 13: Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Store Queue Comparator (STQ)

Store Queue Comparator Compares outputs to data cache Catch faults before propagating to

rest of system

Page 14: Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Store Queue Comparator (cont’d)

Extends residence time of leading-thread stores Size constrained by cycle time goal Base CPU statically partitions single queue among threads Potential solution: per-thread store queues

Deadlock if matching trailing store cannot commit Several small but crucial changes to avoid this

Page 15: Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Branch Outcome Queue (BOQ)

Branch Outcome Queue Forward leading-thread branch targets

to trailing fetch 100% prediction accuracy in absence of

faults

Page 16: Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Simultaneous & Redundantly Threaded Processor (SRT)

SRT = SMT + Fault DetectionLess hardware compared to replicated

microprocessors SMT needs ~5% more hardware over uniprocessor SRT adds very little hardware overhead to existing

SMT

Better performance than complete replication better use of resources Lower cost

Page 17: Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Issues

Cycle-by-cycle output comparison and input replication: Equivalent insts from different threads may

execute in different cycles Equivalent insts from different threads might

execute in different orderPrecise scheduling of the threads crucial

for optimal performanceBranch mispredictionCache miss