1 A Scalable Approach to Thread-Level Speculation Steffan Carnegie Mellon A Scalable Approach to A Scalable Approach to Thread-Level Speculation Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Antonia Zhai, and Todd C. Mowry Computer Science Department Computer Science Department Carnegie Mellon University Carnegie Mellon University (Appeared in ISCA 2000.) (Appeared in ISCA 2000.)
71
Embed
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Computer Science Department Carnegie Mellon University (Appeared in ISCA 2000.). P. P. P. P. C. C. C. C. C. C. C. Shared Memory. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
A Scalable Approach to A Scalable Approach to
Thread-Level SpeculationThread-Level Speculation
J. Gregory Steffan, Christopher B. Colohan, J. Gregory Steffan, Christopher B. Colohan,
Antonia Zhai, and Todd C. MowryAntonia Zhai, and Todd C. Mowry
Computer Science DepartmentComputer Science Department
Carnegie Mellon UniversityCarnegie Mellon University
(Appeared in ISCA 2000.)(Appeared in ISCA 2000.)
2A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Multithreaded Machines Are EverywhereMultithreaded Machines Are Everywhere
How can we use them? Parallelism!
C
P
C
C
P
C
C
P
C
Shared Memory
SUN MAJC,IBM Power4 ALPHA 21464 Dual Pentium SGI Origin
Threads
C
P
C
C
P
C
Shared MemoryC
C
P
C
P
3A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
How can we make the compiler’s job feasible?How can we make the compiler’s job feasible?
Thread-Level Speculation (TLS)
4A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
ExampleExample
while (...){
x = hash[index1]; … hash[index2] = y; ...
}
Time= hash[3]…hash[10] =…
Processor
= hash[19]…hash[21] =…
= hash[33]…hash[30] =…
= hash[10]…hash[25] =…
5A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Example of Thread-Level SpeculationExample of Thread-Level Speculation
Time
= hash[3]…hash[10] =…
Epoch 1
= hash[19]…hash[21] =…
Epoch 2
= hash[33]…hash[30] =…
Epoch 3
= hash[10]…hash[25] =…
Epoch 4
Processor Processor Processor Processor
6A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Example of Thread-Level SpeculationExample of Thread-Level Speculation
Time
= hash[3]…hash[10] =…
Epoch 1= hash[19]…hash[21] =…
Epoch 2= hash[33]…hash[30] =…
Epoch 3= hash[10]…hash[25] =…
Epoch 4
Processor Processor Processor Processor
Violation!
7A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Example of Thread-Level SpeculationExample of Thread-Level Speculation
Time
= hash[3]…hash[10] =…commit?
Epoch 1= hash[19]…hash[21] =…commit?
Epoch 2= hash[33]…hash[30] =…commit?
Epoch 3= hash[10]…hash[25] =…commit?
Epoch 4
Processor Processor Processor Processor
Violation!
8A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Example of Thread-Level SpeculationExample of Thread-Level Speculation
Time
= hash[3]…hash[10] =…commit?
Epoch 1= hash[19]…hash[21] =…commit?
Epoch 2= hash[33]…hash[30] =…commit?
Epoch 3= hash[10]…hash[25] =…commit?
Epoch 4
Processor Processor Processor Processor
Violation!
= hash[10]…hash[25] =…commit?
Epoch 4Retry
9A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Goals of Our ApproachGoals of Our Approach
1) Handle arbitrary memory accesses1) Handle arbitrary memory accesses– i.e. not just array referencesi.e. not just array references
2) Preserve performance of non-speculative workloads2) Preserve performance of non-speculative workloads– keep hardware support minimal and simplekeep hardware support minimal and simple
3) Apply to any scale of multithreaded architecture3) Apply to any scale of multithreaded architecture– CMPs, SMT processors, more traditional MPsCMPs, SMT processors, more traditional MPs
effective, simple, and scalable TLS
10A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Overview of Our ApproachOverview of Our Approach
System requirements:System requirements:
1) Detect data dependence violations1) Detect data dependence violations • extend invalidation-based cache coherenceextend invalidation-based cache coherence
2) Buffer speculative modifications2) Buffer speculative modifications• use the caches as speculative buffersuse the caches as speculative buffers
coherence already works at a variety of scales
hence our scheme is also scalable
11A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
• Illinois at U.C. (I-ACOMA)Illinois at U.C. (I-ACOMA)
our approach seamlessly scales both up and down
12A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
OutlineOutline
Details of our ApproachDetails of our Approach– life cycle of an epochlife cycle of an epoch
– speculative coherence speculative coherence
– what happens at commit timewhat happens at commit time
– forwarding data between epochsforwarding data between epochs
• PerformancePerformance
• ConclusionsConclusions
13A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Life Cycle of an EpochLife Cycle of an Epoch
Spawned
BecomesSpeculative
Commit?
Init
SpeculativeWork
Wait to beHomefree?
Slow Commit:
Fast Commit:
Complete,Pass Homefree
Time
14A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Epoch NumbersEpoch Numbers
Represent a partial orderingRepresent a partial ordering– signed-compare sequence numbers if TIDs matchsigned-compare sequence numbers if TIDs match
• allows for wrap-aroundallows for wrap-around
– otherwise the epochs are unorderedotherwise the epochs are unordered• from independent programs from independent programs • from independent chains of speculation within one from independent chains of speculation within one
programprogram
Thread Identifier (TID) Sequence Number
15A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Speculative Thread ModelSpeculative Thread ModelRound-robin schedule of epochs to processorsRound-robin schedule of epochs to processors
– not a requirement of our scheme, just for conveniencenot a requirement of our scheme, just for convenience
Each epoch spawns the next Each epoch spawns the next – through a lightweight fork instruction (10 cycles)through a lightweight fork instruction (10 cycles)
Violations detected through pollingViolations detected through polling– each epoch runs to completion before detecting failed each epoch runs to completion before detecting failed
speculation and restartingspeculation and restarting
Violation chainingViolation chaining– if an epoch suffers a violation, we squash all logically-later if an epoch suffers a violation, we squash all logically-later
epochsepochs
many possibilities to be evaluated in future work
16A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Preserving CorrectnessPreserving Correctness
Speculation must fail whenever speculative state is lostSpeculation must fail whenever speculative state is lost
– eg., replacement of a speculative line, ORB overfloweg., replacement of a speculative line, ORB overflow
Any exceptions are suppressed until epoch is homefreeAny exceptions are suppressed until epoch is homefree
– eg., divide by zero, segfaulteg., divide by zero, segfault
Polling violation detection must avoid infinite loopingPolling violation detection must avoid infinite looping
– requires a poll inside each looprequires a poll inside each loop
No system calls while speculative (for now)No system calls while speculative (for now)
ensures original sequential semantics are preserved
17A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Life Cycle of an EpochLife Cycle of an Epoch
Spawned
BecomesSpeculative
Commit?
SpeculativeCoherence
Complete,Pass Homefree
Time
to Squashor Commit
Mechanisms
18A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
MESI Coherence ExampleMESI Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
Thread A:
CacheProcessor
-Tag
InvalidState
-Data
Thread B:
19A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
MESI Coherence ExampleMESI Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
CacheProcessor
-Tag
InvalidState
-Data
Load X
Read
Thread A: Thread B:
20A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
MESI Coherence ExampleMESI Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
CacheProcessor
XTag
Excl.State
2Data
Fill
Load XThread A: Thread B:
Read
21A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
MESI Coherence ExampleMESI Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
CacheProcessor
XTag
Excl.State
2Data
Read-Exclusive
Load XStore X=3
read-exclusive invalidates all other copies
Thread A: Thread B:
22A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
MESI Coherence ExampleMESI Coherence Example
Shared Memory (X=2)
CacheProcessor
-Tag
InvalidState
-Data
CacheProcessor
-Tag
InvalidState
-Data
Load XStore X=3
read-exclusive invalidates all other copies
Thread A: Thread B:
Read-Exclusive Invalidation
23A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
MESI Coherence ExampleMESI Coherence Example
Shared Memory (X )
CacheProcessor
XTag
DirtyState
3Data
CacheProcessor
-Tag
InvalidState
-Data
Load XStore X=3
the state ‘dirty’ implies exclusiveness
Fill
Thread A: Thread B:
InvalidationRead-Exclusive
24A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Speculative Coherence ExampleSpeculative Coherence Example
Highlights of our scheme:Highlights of our scheme:– detection of a data dependence violationdetection of a data dependence violation
– has it been has it been speculatively modifiedspeculatively modified??• buffer speculative modificationsbuffer speculative modifications
– is it in a is it in a speculative speculative sharedshared or or exclusiveexclusive state? state?• important performance optimizationsimportant performance optimizations
What if a speculative cache line is replaced?What if a speculative cache line is replaced?– speculation fails for that epochspeculation fails for that epoch
37A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Implementation of Speculative StateImplementation of Speculative State
CacheProcessor
TagState Data-- --- -
-- --- -
38A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Implementation of Speculative StateImplementation of Speculative State
CacheProcessor
State Data- -- -
- -
Tag--
--- -
SL--
--
SM--
--
SpeculativelyModified
SpeculativelyLoaded
modest amount of extra space
39A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Life Cycle of an EpochLife Cycle of an Epoch
Spawned
BecomesSpeculative
Commit?
SpeculativeCoherence
Complete,Pass Homefree
Time
to Squashor Commit
Mechanisms Squash
40A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation FailsWhen Speculation Fails
CacheProcessor
State DataSp Ex *Sp Sh *
Sp Ex *
Tag**
**Sp Sh *
SL11
01
SM00
11
FlashReset
41A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation FailsWhen Speculation Fails
CacheProcessor
State DataExcl *
*
Sp Ex *
Tag**
** *
SL00
00
SM00
11
Shared
Sp Sh
If Set thenInvalidate;
Flash Reset
42A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation FailsWhen Speculation Fails
CacheProcessor
State DataExcl *
*
Invalid *
Tag**
**Invalid *
SL00
00
SM00
00
quick bit operation
Shared
43A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Life Cycle of an EpochLife Cycle of an Epoch
Spawned
BecomesSpeculative
Commit?
SpeculativeCoherence
Complete,Pass Homefree
Time
to Squashor Commit
Mechanisms
Commit
44A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation SucceedsWhen Speculation Succeeds
CacheProcessor
State DataSp Ex *Sp Sh *
Sp Ex *
Tag**
**Sp Sh *
SL11
01
SM00
11
FlashReset
45A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation SucceedsWhen Speculation Succeeds
CacheProcessor
State DataExcl *
*
Sp Ex *
Tag**
**Sp Sh *
SL00
00
SM00
11
SharedSM & Exclusive:Become Dirty
46A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation SucceedsWhen Speculation Succeeds
CacheProcessor
State DataExcl *
*
Sp Ex *
Tag**
**Sp Sh *
SL00
00
SM00
11
SharedSM & Shared:
Need ExclusiveAccess
want to avoid searching entire cache
47A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation SucceedsWhen Speculation Succeeds
CacheProcessor
State DataExcl *
*
Sp Ex *
Tag**
*XSp Sh *
SL00
00
SM00
11
Shared
ownership required buffer (ORB)
--X
ORB
48A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation SucceedsWhen Speculation Succeeds
CacheProcessor
State DataExcl *
*
Sp Ex *
Tag**
*XSp Sh *
SL00
00
SM00
11
Shared
Upgrade-Request (X)
--X
ORB
49A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation SucceedsWhen Speculation Succeeds
CacheProcessor
State DataExcl *
*
Sp Ex *
Tag**
*XSp Sh *
SL00
00
SM00
11
Shared
Ack (X)
---
ORB
If SM,Become Dirty;
Flash Reset
Upgrade-Request (X)
50A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
When Speculation SucceedsWhen Speculation Succeeds
CacheProcessor
State DataExcl *
*
Dirty *
Tag**
*XDirty *
SL00
00
SM00
00
Shared ---
ORB
flush the ORB, then quick bit operations
51A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Forwarding Data Between EpochsForwarding Data Between Epochs
• predictable dependences cause frequent violationspredictable dependences cause frequent violations
Compiler system and tools based on SUIFCompiler system and tools based on SUIF– help analyze dependences, insert synchronizationhelp analyze dependences, insert synchronization
– produce produce MIPSMIPS binaries containing TLS primitives binaries containing TLS primitives
Benchmarks (all run to completion)Benchmarks (all run to completion)– buk, compress95, ijpeg, equakebuk, compress95, ijpeg, equake
SimulatorSimulator– superscalar, similar to superscalar, similar to MIPS R10KMIPS R10K– models all bandwidth and contention models all bandwidth and contention
detailed simulation!C
C
P
C
P
Crossbar
57A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Pipeline ParametersPipeline Parameters
Issue WidthIssue Width 44Functional UnitsFunctional Units 2Int, 2FP, 1Mem, 1Bra2Int, 2FP, 1Mem, 1BraReorder Buffer SizeReorder Buffer Size 3232Integer MultiplyInteger Multiply 12 cycles12 cyclesInteger DivideInteger Divide 76 cycles76 cyclesAll Other IntegerAll Other Integer 1 cycle1 cycleFP DivideFP Divide 15 cycles15 cyclesFP Square RootFP Square Root 20 cycles20 cyclesAll Other FPAll Other FP 2 cycles2 cyclesBranch PredictionBranch Prediction GShare (16KB, 8 history bits)GShare (16KB, 8 history bits)
58A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Memory ParametersMemory ParametersCache Line SizeCache Line Size 32B32BInstruction CacheInstruction Cache 32KB, 4-way set-assoc32KB, 4-way set-assocData CacheData Cache 32KB, 2-way set-assoc, 2 banks32KB, 2-way set-assoc, 2 banksUnified Secondary CacheUnified Secondary Cache 2MB, 4-way set-assoc, 4 banks 2MB, 4-way set-assoc, 4 banks Miss HandlersMiss Handlers 8 for data, 2 for insts8 for data, 2 for instsCrossbar InterconnectCrossbar Interconnect 8B per cycle per bank8B per cycle per bankMinimum Miss Latency to Minimum Miss Latency to Secondary CacheSecondary Cache
10 cycles10 cycles
Minimum Miss Latency to Local Minimum Miss Latency to Local MemoryMemory
75 cycles75 cycles
Main Memory BandwidthMain Memory Bandwidth 1 access per 20 cycles1 access per 20 cyclesIntra-Chip Communication LatencyIntra-Chip Communication Latency 10 cycles10 cycles
Inter-Chip Communication LatencyInter-Chip Communication Latency 200 cycles200 cycles
59A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Benchmark Details: Regions and EpochsBenchmark Details: Regions and Epochs
66A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Tracking Dependences Per Cache LineTracking Dependences Per Cache Line
Problem:Problem:– analagous to false sharing: false violationsanalagous to false sharing: false violations
– write-after-write dependences also cause violationswrite-after-write dependences also cause violations• but not a true dependence!but not a true dependence!
Solution:Solution:– track dependences at a word granularitytrack dependences at a word granularity
– have an SM and SL bit per word in each cache linehave an SM and SL bit per word in each cache line
is per-word state worth the extra overhead?
67A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
Tracking Dependences Per Cache LineTracking Dependences Per Cache Line
Does it do any good?Does it do any good?– not for our 4 benchmarksnot for our 4 benchmarks
– adding this support showed no improvementadding this support showed no improvement
Why not?Why not?– buk and equake have random access patternsbuk and equake have random access patterns
– compress95 is heavily synchronizedcompress95 is heavily synchronized
– ijpeg is unrolled to avoid false sharingijpeg is unrolled to avoid false sharing
existing techniques for avoiding false sharing can address this problem
68A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
71A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon
ConclusionsConclusions
The overheads of our scheme are low:The overheads of our scheme are low:– mechanisms to squash or commit are not a bottleneckmechanisms to squash or commit are not a bottleneck
– per-word speculative state is not always necessaryper-word speculative state is not always necessary
It offers compelling performance improvements:It offers compelling performance improvements:– program speedups from 8% to 46% on a 4-processor program speedups from 8% to 46% on a 4-processor
CMPCMP
– program speedups up to 75% on multi-chip architecturesprogram speedups up to 75% on multi-chip architectures
It is scalable:It is scalable:– coherence provides elegant data dependence trackingcoherence provides elegant data dependence tracking