University of Michigan Advanced Computer Architecture Laboratory StageWeb: Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric Shantanu Gupta Amin Ansari Shuguang Feng Scott Mahlke University of Michigan - Ann Arbor June 29, 2010 1
36
Embed
StageWeb : Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric
StageWeb : Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric. Shantanu Gupta Amin Ansari Shuguang Feng Scott Mahlke University of Michigan - Ann Arbor June 29, 2010. Reliability Threats. Transient Faults due to Cosmic Rays & Alpha Particles - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
University of MichiganAdvanced Computer Architecture Laboratory
1
StageWeb: Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric
Shantanu Gupta Amin Ansari Shuguang Feng Scott Mahlke
University of Michigan - Ann Arbor
June 29, 2010
University of MichiganAdvanced Computer Architecture Laboratory
2
Reliability ThreatsTransient Faults due to
Cosmic Rays & Alpha Particles(Increase exponentially withnumber of devices on chip)
N+ N+
Source DrainGate
P--+-+
-+-+
-+
Silicon Defects(Manufacturing defects and device wear-out)
Negative Bias Threshold Inversion
Oxide
Oxide Breakdown
Electromigration
C C C
C C C
C C C
Frequency
Process Variation(random and systematic variations
Intra-die ILD thicknessSpeed binning on a die
University of MichiganAdvanced Computer Architecture Laboratory
3
Fault Tolerance Aspects
Detect and Diagnose Reconfigure Recover
Has anything gone wrong?
Figure out the cause
Isolate the broken
components
Resume execution
from a safe point
University of MichiganAdvanced Computer Architecture Laboratory
4
Reconfiguring a Multi-core• At the coarsest level, cores can be disabled.
• Rumors that industry already uses this….► IBM Cell w/ 7 SPEs, AMD Tri-Core
• Can’t scale to higher failure rates!
C C C
C C C
C C C
C C C
C C C
C C C
C C C
C C C
C C C
C C C
C C C
C C C
Year 1 Year 3 Year 5 Year 7
University of MichiganAdvanced Computer Architecture Laboratory
University of MichiganAdvanced Computer Architecture Laboratory
20
StageWeb Benefits
1. Scalability► Scaling SN to benefit 100+ core systems
2. Interconnection Reliability► Handling faults in crossbars and links
3. Process Variation► Slower components can be isolated in a multi-core chip
University of MichiganAdvanced Computer Architecture Laboratory
21
Mitigating Process Variation
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Severe process variation and lifetime wearout can result in a disparity of health for various resourcesStageNet can effectively isolate strong/weak resources
Ex/MemIssue
Fetch Decode
Fast
Medium
Slow
Fast
Frequency
University of MichiganAdvanced Computer Architecture Laboratory
22
Evaluation• Open RISC 1200 cores (4-stage in-order)• 12 configurations compared, 64-cores each
• Experiments► Lifetime evaluations - throughput and total work► Process variation - speed binning on a die
SingleSingle + Front/Back
OverlappingOverlapping +
Front/Back
W/O sparesW/ spares
Fault-tolerant
Interconnections Crossbar types
University of MichiganAdvanced Computer Architecture Laboratory
23
Lifetime Reliability Evaluations• Monte Carlo simulation with 300+ lifetime experiments
• Where, each lifetime experiment involves -► Assigning a time-to-failure to all stages► Killing components at their failure times► Reconfiguring system to isolate broken components► Repeating this until no logical pipeline can be formed
• Cumulative work and throughput are recorded► Number of cores: 64► Technology node: 90 nm
University of MichiganAdvanced Computer Architecture Laboratory
University of MichiganAdvanced Computer Architecture Laboratory
27
Mitigating Process Variation
0.730
0000
0000
0001
0.760
0000
0000
0001 0.7
9
0.820
0000
0000
0001
0.850
0000
0000
0001 0.8
80.9
1
0.940
0000
0000
0001 0.9
7
0.999
9999
9999
9999
0
4
8
12
16
Traditional CMP StageWeb CMP
Frequency (normalized)
Num
ber o
f cor
es
Freq
27
45For a given frequency target, StageWeb can operate:1. More cores, OR2. Same # of cores at lower voltage
University of MichiganAdvanced Computer Architecture Laboratory
28
Conclusions• Architectural innovations will be crucial in tackling
technological uncertainties
• StageWeb is a potential solution► Allows fine-grained isolation of failures► Most reliability gains from grouping 8-10 pipelines► Scalable to 100+ cores
• StageWeb can also mitigate process variation by grouping together faster and slower parts
University of MichiganAdvanced Computer Architecture Laboratory
29
Thank You
http://cccp.eecs.umich.edu
University of MichiganAdvanced Computer Architecture Laboratory
30
Back up slides
University of MichiganAdvanced Computer Architecture Laboratory
31
Impact of Defects on CMP Yield
University of MichiganAdvanced Computer Architecture Laboratory
32
Overlapping Network
University of MichiganAdvanced Computer Architecture Laboratory
33
Simple + 2nd Level Crossbars
University of MichiganAdvanced Computer Architecture Laboratory
34
Overlapping + 2nd Level Crossbar
University of MichiganAdvanced Computer Architecture Laboratory