Top Banner
1 CS71 7 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter
22

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

Jan 21, 2016

Download

Documents

Piers Hampton
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

1

CS717

Hardware Fault Tolerance Through Simultaneous Multithreading (part 3)

Jonathan Winter

Page 2: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

2

CS717

3 SMT + Fault Tolerance Papers

• Eric Rotenberg, "AR-SMT - A Microarchitectural Approach to Fault Tolerance in Microprocessors", Symposium on Fault-Tolerant Computing, 1999.

• Steven K. Reinhardt and Shubhendu S. Mukherjee, "Transient Fault Detection via Simultaneous Multithreading", ISCA 2000.

• Shubhendu S. Mukherjee, Michael Kontz and Steven K. Reinhardt, "Detailed Design and Evaluation of Redundant Multithreading Alternatives", ISCA 2002.

Page 3: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

3

CS717

Outline

1. Background• SMT• Hardware fault tolerance

2. AR-SMT• Basic mechanisms• Implementation issues• Simulation and Results

3. Transient Fault Detection via SMT• Sphere of replication• Basic mechanisms• Comparison to AR-SMT• Simulation and Results

4. Redundant Multithreading Alternatives• Realistic processor implementation• CRT• Simulation and Results

5. Fault Recovery6. Future Lectures ?

Page 4: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

4

CS717

Sphere of Replication

• Size of sphere of replication– Two alternatives – with and without register file– Instruction and data caches kept outside

Page 5: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

5

CS717

Redundant Multithreading Alternatives

• Discusses real world fault tolerant processors• Evaluates SRT on a more realistic and

detailed processor than the previous paper• Proposes Chip-level Redundant Threading

(CRT)• Detailed simulation results with new metric

– Relative SMT-Efficiency

Page 6: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

6

CS717

Real World SMT, CMP, and FT

• Simulated processor based on Compaq Alpha Araña (a.k.a. 21464 or EV8)

• IBM Power4 and HP Mako are 2-way CMPs• Compaq Himalaya uses multi-chip

lockstepping• IBM S/390 G5 uses on-chip lockstepping

Page 7: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

7

CS717

Detailed Processor Description

• 8 way SMT with 4 hardware contexts

• IBOX fetches chunks of 8 instructions and forwards them to the PBOX

• Complex branch prediction mechanism– Line predictor– Branch predictor, jump

target predictor, and return address stack

Page 8: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

8

CS717

Detailed Processor Description (part 2)

• PBOX performs initial processing– Register renaming and partial decoding– Maintains tables for recovery from miss-predictions

• QBOX issues instructions out-of-order to the EBOX, FBOX, or MBOX– Retires instructions and commits architectural state in

program order– Consists of instruction queue, in-flight table, and completion

unit

• MBOX conducts loads and stores– Load and store queues divided between threads– Available queue space is very small per thread

Page 9: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

9

CS717

SRT on Detailed Processor

• Input replication uses LVQ variant that allows out-of-order load issue from trailing thread

• Output comparison is the same as SRT– Improvement is suggested that has per-thread SQ

• PBOX storage structures made per-thread to avoid deadlock situations

• Branch outcome queue converted to line prediction queue

• Preferential space redundancy (PSR) implemented to better cover permanent faults

Page 10: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

10

CS717

Chip-level Redundant Threading

• Each core executes a lead and trailing thread from different programs

• LVQ and line prediction queue must forward data to other processor’s trailing thread

• Store buffer must receive retired stores from other processor for comparison

Page 11: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

11

CS717

CRT Advantages

• CRT checks much less information to detect faults then lockstepped processors

• Lockstep fault detection circuitry is on critical path for cache misses

• CRT executes threads more efficiently because of SMT dynamic scheduling on each processor

Page 12: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

12

CS717

Simulation Environment

• Asim performance model framework used– Simulates processor like Alpha 21464

• All 18 SPEC CPU95 benchmarks used• Combinations of SPEC used for multi-program

simulations• Lockstepped processor simulated with zero

fault detection delay (Lock0) and with 8-cycle delay (Lock8)

• SRT architecture simulated with delays for forwarding line predictions and load values– Extra delays for CRT architecture

Page 13: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

13

CS717

SMT-Efficiency

• SMT-Efficiency (SMT-E) used instead of IPC – SMT-E of individual thread is IPC of thread in SMT mode

divided by the IPC in single-thread mode in an SMT– Overall SMT-E is arithmetic mean of individual SMT-

Efficiencies– A. Snavely and D. M. Tullsen, “Symbiotic Job Scheduling for

a Simultaneous Multithreading Processor”, ASPLOS 2000

Page 14: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

14

CS717

SMT-Speedup (= SMT-Efficiency)

– Y. Sazeides and T. Juan, “How to Compare the Performance of Two SMT Microarchitectures”, ISPASS 2001

Page 15: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

15

CS717

Preferential Space Redundancy

• Without PSR, 65% of instructions execute on same functional unit

• With PSR, only 0.06% of instructions run on the same unit

• No performance degradation is experienced

Page 16: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

16

CS717

SRT – One Logical Thread

• SRT 32% slower than single thread on SMT• SRT 11% faster than running two redundant copies• Degradation 30% with per-thread store queue• Best-case 26% degradation with oracle store queue

Page 17: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

17

CS717

SRT – Two Logical Threads

• Degradation of SRT is 40%

• Per-thread store queue give 32% degradation

• Store lifetime drops from 44 cycles to vs. 39 for single thread

• Oracle store queue gives 5% better efficiency

Page 18: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

18

CS717

Chip-level Redundant Threading

• With one logical thread, CRT performs similarly to lockstepping• With two logical threads CRT beats Lock0 and Lock8 by 10%

and 2% respectively• Adding the per-thread store queue causes CRT to beat Lock8

by 13% average (22% maximum)• Using an oracle store queue improves performance by 6% more

Page 19: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

19

CS717

CRT with Four Logical Threads

• Initial CRT configuration is no better than Lock8

• Adding per-thread store queue gives CRT 13% better performance than Lock8

• Using an oracle store queue improve performance only by another 2%

Page 20: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

20

CS717

Conclusions

• The benefits of SRT are not as great as in the original paper when using a detailed model– 30% and 32% degradation seen on single thread

and multithread workloads

• SRT methods can be used to detect permanent faults

• Chip-level redundant threading gives improved performance over lockstepped processors– Overall CRT provided a 13% improvement

Page 21: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

21

CS717

Transient Fault Recovery

• AR-SMT suggests that the R-stream could be used as a checkpoint for recovery

• SRT suggests checkpoint/restart or failover– Argues that since faults are infrequent, the will

have a minor impact on performance

Page 22: CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

22

CS717

Future Lectures ?

• Hardware Transient Fault Recovery• T.N. Vijaykumar, Irith Pomeranz, and Karl Cheng, “Transient-

Fault Recovery Using Simultaneous Multithreading”, ISCA 2002• Mohamed Gomaa, Chad Scarbrough, T.N. Vijaykumar, and Irith

Pomeranz, “Transient-Fault Recovery for Chip Multiprocessors”, ISCA 2003

• Slipstream Processors (an AR-SMT extension)• Karthik Sundaramoorth, Zach Purser, and Eric Rotenberg,

“Slipstream Processors: Improving both Performance and Fault Tolerance”, ASPLOS 2000

• Khaled Z. Ibrahim, Gregory T. Byrd, and Eric Rotenberg, “Slipstream Execution Mode for CMP-Based Multiprocessors”, HPCA 2003