Hardware Fault Tolerance Through Simultaneous Multithreading (part 3)

1

CS717

Hardware Fault Tolerance Through Simultaneous Multithreading (part 3)

Jonathan Winter

2

CS717

3 SMT + Fault Tolerance Papers

• Eric Rotenberg, "AR-SMT - A Microarchitectural Approach to Fault Tolerance in Microprocessors", Symposium on Fault-Tolerant Computing, 1999.

• Steven K. Reinhardt and Shubhendu S. Mukherjee, "Transient Fault Detection via Simultaneous Multithreading", ISCA 2000.

• Shubhendu S. Mukherjee, Michael Kontz and Steven K. Reinhardt, "Detailed Design and Evaluation of Redundant Multithreading Alternatives", ISCA 2002.

3

CS717

Outline

1. Background• SMT• Hardware fault tolerance

2. AR-SMT• Basic mechanisms• Implementation issues• Simulation and Results

3. Transient Fault Detection via SMT• Sphere of replication• Basic mechanisms• Comparison to AR-SMT• Simulation and Results

4. Redundant Multithreading Alternatives• Realistic processor implementation• CRT• Simulation and Results

5. Fault Recovery6. Future Lectures ?

4

CS717

Sphere of Replication

• Size of sphere of replication– Two alternatives – with and without register file– Instruction and data caches kept outside

5

CS717

Redundant Multithreading Alternatives

• Discusses real world fault tolerant processors• Evaluates SRT on a more realistic and

detailed processor than the previous paper• Proposes Chip-level Redundant Threading

(CRT)• Detailed simulation results with new metric

– Relative SMT-Efficiency

6

CS717

Real World SMT, CMP, and FT

• Simulated processor based on Compaq Alpha Araña (a.k.a. 21464 or EV8)

• IBM Power4 and HP Mako are 2-way CMPs• Compaq Himalaya uses multi-chip

lockstepping• IBM S/390 G5 uses on-chip lockstepping

7

CS717

Detailed Processor Description

• 8 way SMT with 4 hardware contexts

• IBOX fetches chunks of 8 instructions and forwards them to the PBOX

• Complex branch prediction mechanism– Line predictor– Branch predictor, jump

target predictor, and return address stack

8

CS717

Detailed Processor Description (part 2)

• PBOX performs initial processing– Register renaming and partial decoding– Maintains tables for recovery from miss-predictions

• QBOX issues instructions out-of-order to the EBOX, FBOX, or MBOX– Retires instructions and commits architectural state in

program order– Consists of instruction queue, in-flight table, and completion

unit

• MBOX conducts loads and stores– Load and store queues divided between threads– Available queue space is very small per thread

9

CS717

SRT on Detailed Processor

• Input replication uses LVQ variant that allows out-of-order load issue from trailing thread

• Output comparison is the same as SRT– Improvement is suggested that has per-thread SQ

• PBOX storage structures made per-thread to avoid deadlock situations

• Branch outcome queue converted to line prediction queue

• Preferential space redundancy (PSR) implemented to better cover permanent faults

10

CS717

Chip-level Redundant Threading

• Each core executes a lead and trailing thread from different programs

• LVQ and line prediction queue must forward data to other processor’s trailing thread

• Store buffer must receive retired stores from other processor for comparison

11

CS717

CRT Advantages

• CRT checks much less information to detect faults then lockstepped processors

• Lockstep fault detection circuitry is on critical path for cache misses

• CRT executes threads more efficiently because of SMT dynamic scheduling on each processor

12

CS717

Simulation Environment

• Asim performance model framework used– Simulates processor like Alpha 21464

• All 18 SPEC CPU95 benchmarks used• Combinations of SPEC used for multi-program

simulations• Lockstepped processor simulated with zero

fault detection delay (Lock0) and with 8-cycle delay (Lock8)

• SRT architecture simulated with delays for forwarding line predictions and load values– Extra delays for CRT architecture

13

CS717

SMT-Efficiency

• SMT-Efficiency (SMT-E) used instead of IPC – SMT-E of individual thread is IPC of thread in SMT mode

divided by the IPC in single-thread mode in an SMT– Overall SMT-E is arithmetic mean of individual SMT-

Efficiencies– A. Snavely and D. M. Tullsen, “Symbiotic Job Scheduling for

a Simultaneous Multithreading Processor”, ASPLOS 2000

14

CS717

SMT-Speedup (= SMT-Efficiency)

– Y. Sazeides and T. Juan, “How to Compare the Performance of Two SMT Microarchitectures”, ISPASS 2001

15

CS717

Preferential Space Redundancy

• Without PSR, 65% of instructions execute on same functional unit

• With PSR, only 0.06% of instructions run on the same unit

• No performance degradation is experienced

16

CS717

SRT – One Logical Thread

• SRT 32% slower than single thread on SMT• SRT 11% faster than running two redundant copies• Degradation 30% with per-thread store queue• Best-case 26% degradation with oracle store queue

17

CS717

SRT – Two Logical Threads

• Degradation of SRT is 40%

• Per-thread store queue give 32% degradation

• Store lifetime drops from 44 cycles to vs. 39 for single thread

• Oracle store queue gives 5% better efficiency

18

CS717

Chip-level Redundant Threading

• With one logical thread, CRT performs similarly to lockstepping• With two logical threads CRT beats Lock0 and Lock8 by 10%

and 2% respectively• Adding the per-thread store queue causes CRT to beat Lock8

by 13% average (22% maximum)• Using an oracle store queue improves performance by 6% more

19

CS717

CRT with Four Logical Threads

• Initial CRT configuration is no better than Lock8

• Adding per-thread store queue gives CRT 13% better performance than Lock8

• Using an oracle store queue improve performance only by another 2%

20

CS717

Conclusions

• The benefits of SRT are not as great as in the original paper when using a detailed model– 30% and 32% degradation seen on single thread

and multithread workloads

• SRT methods can be used to detect permanent faults

• Chip-level redundant threading gives improved performance over lockstepped processors– Overall CRT provided a 13% improvement

21

CS717

Transient Fault Recovery

• AR-SMT suggests that the R-stream could be used as a checkpoint for recovery

• SRT suggests checkpoint/restart or failover– Argues that since faults are infrequent, the will

have a minor impact on performance

22

CS717

Future Lectures ?

• Hardware Transient Fault Recovery• T.N. Vijaykumar, Irith Pomeranz, and Karl Cheng, “Transient-

Fault Recovery Using Simultaneous Multithreading”, ISCA 2002• Mohamed Gomaa, Chad Scarbrough, T.N. Vijaykumar, and Irith

Pomeranz, “Transient-Fault Recovery for Chip Multiprocessors”, ISCA 2003

• Slipstream Processors (an AR-SMT extension)• Karthik Sundaramoorth, Zach Purser, and Eric Rotenberg,

“Slipstream Processors: Improving both Performance and Fault Tolerance”, ASPLOS 2000

• Khaled Z. Ibrahim, Gregory T. Byrd, and Eric Rotenberg, “Slipstream Execution Mode for CMP-Based Multiprocessors”, HPCA 2003

Hardware Fault Tolerance Through Simultaneous Multithreading (part 3)

Documents