1 CS71 7 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter
Jan 01, 2016
1
CS717
Hardware Fault Tolerance Through Simultaneous Multithreading (part 3)
Jonathan Winter
2
CS717
3 SMT + Fault Tolerance Papers
• Eric Rotenberg, "AR-SMT - A Microarchitectural Approach to Fault Tolerance in Microprocessors", Symposium on Fault-Tolerant Computing, 1999.
• Steven K. Reinhardt and Shubhendu S. Mukherjee, "Transient Fault Detection via Simultaneous Multithreading", ISCA 2000.
• Shubhendu S. Mukherjee, Michael Kontz and Steven K. Reinhardt, "Detailed Design and Evaluation of Redundant Multithreading Alternatives", ISCA 2002.
3
CS717
Outline
1. Background• SMT• Hardware fault tolerance
2. AR-SMT• Basic mechanisms• Implementation issues• Simulation and Results
3. Transient Fault Detection via SMT• Sphere of replication• Basic mechanisms• Comparison to AR-SMT• Simulation and Results
4. Redundant Multithreading Alternatives• Realistic processor implementation• CRT• Simulation and Results
5. Fault Recovery6. Future Lectures ?
4
CS717
Sphere of Replication
• Size of sphere of replication– Two alternatives – with and without register file– Instruction and data caches kept outside
5
CS717
Redundant Multithreading Alternatives
• Discusses real world fault tolerant processors• Evaluates SRT on a more realistic and
detailed processor than the previous paper• Proposes Chip-level Redundant Threading
(CRT)• Detailed simulation results with new metric
– Relative SMT-Efficiency
6
CS717
Real World SMT, CMP, and FT
• Simulated processor based on Compaq Alpha Araña (a.k.a. 21464 or EV8)
• IBM Power4 and HP Mako are 2-way CMPs• Compaq Himalaya uses multi-chip
lockstepping• IBM S/390 G5 uses on-chip lockstepping
7
CS717
Detailed Processor Description
• 8 way SMT with 4 hardware contexts
• IBOX fetches chunks of 8 instructions and forwards them to the PBOX
• Complex branch prediction mechanism– Line predictor– Branch predictor, jump
target predictor, and return address stack
8
CS717
Detailed Processor Description (part 2)
• PBOX performs initial processing– Register renaming and partial decoding– Maintains tables for recovery from miss-predictions
• QBOX issues instructions out-of-order to the EBOX, FBOX, or MBOX– Retires instructions and commits architectural state in
program order– Consists of instruction queue, in-flight table, and completion
unit
• MBOX conducts loads and stores– Load and store queues divided between threads– Available queue space is very small per thread
9
CS717
SRT on Detailed Processor
• Input replication uses LVQ variant that allows out-of-order load issue from trailing thread
• Output comparison is the same as SRT– Improvement is suggested that has per-thread SQ
• PBOX storage structures made per-thread to avoid deadlock situations
• Branch outcome queue converted to line prediction queue
• Preferential space redundancy (PSR) implemented to better cover permanent faults
10
CS717
Chip-level Redundant Threading
• Each core executes a lead and trailing thread from different programs
• LVQ and line prediction queue must forward data to other processor’s trailing thread
• Store buffer must receive retired stores from other processor for comparison
11
CS717
CRT Advantages
• CRT checks much less information to detect faults then lockstepped processors
• Lockstep fault detection circuitry is on critical path for cache misses
• CRT executes threads more efficiently because of SMT dynamic scheduling on each processor
12
CS717
Simulation Environment
• Asim performance model framework used– Simulates processor like Alpha 21464
• All 18 SPEC CPU95 benchmarks used• Combinations of SPEC used for multi-program
simulations• Lockstepped processor simulated with zero
fault detection delay (Lock0) and with 8-cycle delay (Lock8)
• SRT architecture simulated with delays for forwarding line predictions and load values– Extra delays for CRT architecture
13
CS717
SMT-Efficiency
• SMT-Efficiency (SMT-E) used instead of IPC – SMT-E of individual thread is IPC of thread in SMT mode
divided by the IPC in single-thread mode in an SMT– Overall SMT-E is arithmetic mean of individual SMT-
Efficiencies– A. Snavely and D. M. Tullsen, “Symbiotic Job Scheduling for
a Simultaneous Multithreading Processor”, ASPLOS 2000
14
CS717
SMT-Speedup (= SMT-Efficiency)
– Y. Sazeides and T. Juan, “How to Compare the Performance of Two SMT Microarchitectures”, ISPASS 2001
15
CS717
Preferential Space Redundancy
• Without PSR, 65% of instructions execute on same functional unit
• With PSR, only 0.06% of instructions run on the same unit
• No performance degradation is experienced
16
CS717
SRT – One Logical Thread
• SRT 32% slower than single thread on SMT• SRT 11% faster than running two redundant copies• Degradation 30% with per-thread store queue• Best-case 26% degradation with oracle store queue
17
CS717
SRT – Two Logical Threads
• Degradation of SRT is 40%
• Per-thread store queue give 32% degradation
• Store lifetime drops from 44 cycles to vs. 39 for single thread
• Oracle store queue gives 5% better efficiency
18
CS717
Chip-level Redundant Threading
• With one logical thread, CRT performs similarly to lockstepping• With two logical threads CRT beats Lock0 and Lock8 by 10%
and 2% respectively• Adding the per-thread store queue causes CRT to beat Lock8
by 13% average (22% maximum)• Using an oracle store queue improves performance by 6% more
19
CS717
CRT with Four Logical Threads
• Initial CRT configuration is no better than Lock8
• Adding per-thread store queue gives CRT 13% better performance than Lock8
• Using an oracle store queue improve performance only by another 2%
20
CS717
Conclusions
• The benefits of SRT are not as great as in the original paper when using a detailed model– 30% and 32% degradation seen on single thread
and multithread workloads
• SRT methods can be used to detect permanent faults
• Chip-level redundant threading gives improved performance over lockstepped processors– Overall CRT provided a 13% improvement
21
CS717
Transient Fault Recovery
• AR-SMT suggests that the R-stream could be used as a checkpoint for recovery
• SRT suggests checkpoint/restart or failover– Argues that since faults are infrequent, the will
have a minor impact on performance
22
CS717
Future Lectures ?
• Hardware Transient Fault Recovery• T.N. Vijaykumar, Irith Pomeranz, and Karl Cheng, “Transient-
Fault Recovery Using Simultaneous Multithreading”, ISCA 2002• Mohamed Gomaa, Chad Scarbrough, T.N. Vijaykumar, and Irith
Pomeranz, “Transient-Fault Recovery for Chip Multiprocessors”, ISCA 2003
• Slipstream Processors (an AR-SMT extension)• Karthik Sundaramoorth, Zach Purser, and Eric Rotenberg,
“Slipstream Processors: Improving both Performance and Fault Tolerance”, ASPLOS 2000
• Khaled Z. Ibrahim, Gregory T. Byrd, and Eric Rotenberg, “Slipstream Execution Mode for CMP-Based Multiprocessors”, HPCA 2003