A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

1A Scalable Approach to Thread-Level Speculation SteffanCarnegie Mellon

A Scalable Approach to A Scalable Approach to

Thread-Level SpeculationThread-Level Speculation

J. Gregory Steffan, Christopher B. Colohan, J. Gregory Steffan, Christopher B. Colohan,

Antonia Zhai, and Todd C. MowryAntonia Zhai, and Todd C. Mowry

Computer Science DepartmentComputer Science Department

Carnegie Mellon UniversityCarnegie Mellon University

(Appeared in ISCA 2000.)(Appeared in ISCA 2000.)


Multithreaded Machines Are EverywhereMultithreaded Machines Are Everywhere

How can we use them? Parallelism!

C

P

C

C

P

C

C

P

C

Shared Memory

SUN MAJC,IBM Power4 ALPHA 21464 Dual Pentium SGI Origin

Threads

C

P

C

C

P

C

Shared MemoryC

C

P

C

P


Automatic ParallelizationAutomatic Parallelization

Proving independence of threads is hard:Proving independence of threads is hard:– complex control flowcomplex control flow

– complex data structurescomplex data structures

– pointers, pointers, pointerspointers, pointers, pointers

– run-time inputsrun-time inputs

How can we make the compiler’s job feasible?How can we make the compiler’s job feasible?

Thread-Level Speculation (TLS)


ExampleExample

while (...){

x = hash[index1]; … hash[index2] = y; ...

}

Time= hash[3]…hash[10] =…

Processor

= hash[19]…hash[21] =…

= hash[33]…hash[30] =…

= hash[10]…hash[25] =…


Example of Thread-Level SpeculationExample of Thread-Level Speculation

Time

= hash[3]…hash[10] =…

Epoch 1

= hash[19]…hash[21] =…

Epoch 2

= hash[33]…hash[30] =…

Epoch 3

= hash[10]…hash[25] =…

Epoch 4

Processor Processor Processor Processor



Time

= hash[3]…hash[10] =…

Epoch 1= hash[19]…hash[21] =…



Epoch 4


Violation!



Time

= hash[3]…hash[10] =…commit?

Epoch 1= hash[19]…hash[21] =…commit?



Epoch 4


Violation!



Time





Epoch 4


Violation!


Epoch 4Retry


Goals of Our ApproachGoals of Our Approach

1) Handle arbitrary memory accesses1) Handle arbitrary memory accesses– i.e. not just array referencesi.e. not just array references

2) Preserve performance of non-speculative workloads2) Preserve performance of non-speculative workloads– keep hardware support minimal and simplekeep hardware support minimal and simple

3) Apply to any scale of multithreaded architecture3) Apply to any scale of multithreaded architecture– CMPs, SMT processors, more traditional MPsCMPs, SMT processors, more traditional MPs

effective, simple, and scalable TLS


Overview of Our ApproachOverview of Our Approach

System requirements:System requirements:

1) Detect data dependence violations1) Detect data dependence violations • extend invalidation-based cache coherenceextend invalidation-based cache coherence

2) Buffer speculative modifications2) Buffer speculative modifications• use the caches as speculative buffersuse the caches as speculative buffers

coherence already works at a variety of scales

hence our scheme is also scalable


Related SchemesRelated Schemes

• Wisconsin (Multiscalar, Trace Processor)Wisconsin (Multiscalar, Trace Processor)

• Stanford (Hydra)Stanford (Hydra)

• U.P. Catalunya (Speculative Multithreading)U.P. Catalunya (Speculative Multithreading)

• Intel/U. Portland (Dynamic Multithreading)Intel/U. Portland (Dynamic Multithreading)

• Illinois at U.C. (I-ACOMA)Illinois at U.C. (I-ACOMA)

our approach seamlessly scales both up and down


OutlineOutline

Details of our ApproachDetails of our Approach– life cycle of an epochlife cycle of an epoch

– speculative coherence speculative coherence

– what happens at commit timewhat happens at commit time

– forwarding data between epochsforwarding data between epochs

• PerformancePerformance

• ConclusionsConclusions


Life Cycle of an EpochLife Cycle of an Epoch

Spawned

BecomesSpeculative

Commit?

Init

SpeculativeWork

Wait to beHomefree?

Slow Commit:

Fast Commit:

Complete,Pass Homefree

Time


Epoch NumbersEpoch Numbers

Represent a partial orderingRepresent a partial ordering– signed-compare sequence numbers if TIDs matchsigned-compare sequence numbers if TIDs match

• allows for wrap-aroundallows for wrap-around

– otherwise the epochs are unorderedotherwise the epochs are unordered• from independent programs from independent programs • from independent chains of speculation within one from independent chains of speculation within one

programprogram

Thread Identifier (TID) Sequence Number


Speculative Thread ModelSpeculative Thread ModelRound-robin schedule of epochs to processorsRound-robin schedule of epochs to processors

– not a requirement of our scheme, just for conveniencenot a requirement of our scheme, just for convenience

Each epoch spawns the next Each epoch spawns the next – through a lightweight fork instruction (10 cycles)through a lightweight fork instruction (10 cycles)

Violations detected through pollingViolations detected through polling– each epoch runs to completion before detecting failed each epoch runs to completion before detecting failed

speculation and restartingspeculation and restarting

Violation chainingViolation chaining– if an epoch suffers a violation, we squash all logically-later if an epoch suffers a violation, we squash all logically-later

epochsepochs

many possibilities to be evaluated in future work


Preserving CorrectnessPreserving Correctness

Speculation must fail whenever speculative state is lostSpeculation must fail whenever speculative state is lost

– eg., replacement of a speculative line, ORB overfloweg., replacement of a speculative line, ORB overflow

Any exceptions are suppressed until epoch is homefreeAny exceptions are suppressed until epoch is homefree

– eg., divide by zero, segfaulteg., divide by zero, segfault

Polling violation detection must avoid infinite loopingPolling violation detection must avoid infinite looping

– requires a poll inside each looprequires a poll inside each loop

No system calls while speculative (for now)No system calls while speculative (for now)

ensures original sequential semantics are preserved



Spawned

BecomesSpeculative

Commit?

SpeculativeCoherence


Time

to Squashor Commit

Mechanisms


MESI Coherence ExampleMESI Coherence Example

Shared Memory (X=2)

CacheProcessor

-Tag

InvalidState

-Data

Thread A:

CacheProcessor

-Tag

InvalidState

-Data

Thread B:



Shared Memory (X=2)

CacheProcessor

-Tag

InvalidState

-Data

CacheProcessor

-Tag

InvalidState

-Data

Load X

Read

Thread A: Thread B:



Shared Memory (X=2)

CacheProcessor

-Tag

InvalidState

-Data

CacheProcessor

XTag

Excl.State

2Data

Fill

Load XThread A: Thread B:

Read



Shared Memory (X=2)

CacheProcessor

-Tag

InvalidState

-Data

CacheProcessor

XTag

Excl.State

2Data

Read-Exclusive

Load XStore X=3

read-exclusive invalidates all other copies

Thread A: Thread B:



Shared Memory (X=2)

CacheProcessor

-Tag

InvalidState

-Data

CacheProcessor

-Tag

InvalidState

-Data

Load XStore X=3

read-exclusive invalidates all other copies

Thread A: Thread B:

Read-Exclusive Invalidation



Shared Memory (X )

CacheProcessor

XTag

DirtyState

3Data

CacheProcessor

-Tag

InvalidState

-Data

Load XStore X=3

the state ‘dirty’ implies exclusiveness

Fill

Thread A: Thread B:

InvalidationRead-Exclusive


Speculative Coherence ExampleSpeculative Coherence Example

Highlights of our scheme:Highlights of our scheme:– detection of a data dependence violationdetection of a data dependence violation

– speculatively modifiedspeculatively modified andand sharedshared cache lines cache lines

Epoch5: Epoch6:Load X

Epoch4:

Store X=3Load X



Shared Memory (X=2)

CacheProcessor

-Tag

InvalidState

-Data

Epoch5:

CacheProcessor

-Tag

InvalidState

-Data

Epoch6:Load X

Read



Shared Memory (X=2)

CacheProcessor

-Tag

InvalidState

-Data

Epoch5:

CacheProcessor

XTag

Excl.State

2Data

Epoch6:Load X

Fill

Spec.Loaded

track which lines are speculatively loaded

Read



Shared Memory (X=2)

CacheProcessor

-Tag

InvalidState

-Data

Epoch5:

CacheProcessor

XTag

Excl.State

2Data

Epoch6:Load X

Spec.Loaded

Store X=3

Sp Read-Ex (epoch5)

speculative msgs piggyback epoch number



Shared Memory (X=2)

CacheProcessor

-Tag

InvalidState

-Data

Epoch5:

CacheProcessor

XTag

Excl.State

2Data

Epoch6:Load X

Spec.Loaded

Store X=3

Sp Inv (epoch5)

epoch5 < epoch6, and speculatively loaded

Sp Read-Ex (epoch5)



Shared Memory (X=2)

CacheProcessor

-Tag

InvalidState

-Data

Epoch5:

CacheProcessor

-Tag

InvalidState

-Data

Epoch6:Load X

Store X=3 speculation failed!

speculation fails for epoch 6

Sp Inv (epoch5)Sp Read-Ex (epoch5)



Shared Memory (X=2)

CacheProcessor

XTag

Excl.State

3Data

Epoch5: Store X=3

Fill

Spec.Modified

track which lines are speculatively modified

CacheProcessor

-Tag

InvalidState

-Data

Epoch6:Load X speculation

failed!

Sp Inv (epoch5)Sp Read-Ex (epoch5)



Highlights of our scheme:Highlights of our scheme:– detection of a data dependence violationdetection of a data dependence violation

– speculatively modifiedspeculatively modified andand sharedshared cache lines cache lines

Epoch5: Epoch6:Load X

Epoch4:

Store X=3Load X



Shared Memory (X=2)

CacheProcessor

-Tag

InvalidState

-Data

Epoch4:

CacheProcessor

XTag

Excl.State

3Data

Epoch5: Store X=3

Spec.Modified



Shared Memory (X=2)

CacheProcessor

-Tag

InvalidState

-Data

Epoch4:

CacheProcessor

XTag

Excl.State

3Data

Epoch5: Store X=3

Spec.Modified

Load X

Read



Shared Memory (X=2)

CacheProcessor

-Tag

InvalidState

-Data

Epoch4:

CacheProcessor

XTagState

3Data

Epoch5: Store X=3

Spec.Modified

Load X

notify shared

Shared

both speculatively modified and shared!

Read



Shared Memory (X=2)

CacheProcessor

XTagState

2Data

Epoch4:

CacheProcessor

XTagState

3Data

Epoch5: Store X=3

Spec.Modified

Load X

Shared

multiple versions of the same cache line

Fill

SharedSpec.

LoadedRead notify shared


Summary of New Speculative Line StateSummary of New Speculative Line State

New cache line state:New cache line state:– has it been has it been speculatively loadedspeculatively loaded??

• detect dependence violationsdetect dependence violations

– has it been has it been speculatively modifiedspeculatively modified??• buffer speculative modificationsbuffer speculative modifications

– is it in a is it in a speculative speculative sharedshared or or exclusiveexclusive state? state?• important performance optimizationsimportant performance optimizations

What if a speculative cache line is replaced?What if a speculative cache line is replaced?– speculation fails for that epochspeculation fails for that epoch


Implementation of Speculative StateImplementation of Speculative State

CacheProcessor

TagState Data-- --- -

-- --- -


Implementation of Speculative StateImplementation of Speculative State

CacheProcessor

State Data- -- -

- -

Tag--

--- -

SL--

--

SM--

--

SpeculativelyModified

SpeculativelyLoaded

modest amount of extra space



Spawned

BecomesSpeculative

Commit?



Time

to Squashor Commit

Mechanisms Squash


When Speculation FailsWhen Speculation Fails

CacheProcessor

State DataSp Ex *Sp Sh *

Sp Ex *

Tag**

**Sp Sh *

SL11

01

SM00

11

FlashReset



CacheProcessor

State DataExcl *

*

Sp Ex *

Tag**

** *

SL00

00

SM00

11

Shared

Sp Sh

If Set thenInvalidate;

Flash Reset



CacheProcessor

State DataExcl *

*

Invalid *

Tag**

**Invalid *

SL00

00

SM00

00

quick bit operation

Shared



Spawned

BecomesSpeculative

Commit?



Time

to Squashor Commit

Mechanisms

Commit


When Speculation SucceedsWhen Speculation Succeeds

CacheProcessor

State DataSp Ex *Sp Sh *

Sp Ex *

Tag**

**Sp Sh *

SL11

01

SM00

11

FlashReset



CacheProcessor

State DataExcl *

*

Sp Ex *

Tag**

**Sp Sh *

SL00

00

SM00

11

SharedSM & Exclusive:Become Dirty



CacheProcessor

State DataExcl *

*

Sp Ex *

Tag**

**Sp Sh *

SL00

00

SM00

11

SharedSM & Shared:

Need ExclusiveAccess

want to avoid searching entire cache



CacheProcessor

State DataExcl *

*

Sp Ex *

Tag**

*XSp Sh *

SL00

00

SM00

11

Shared

ownership required buffer (ORB)

--X

ORB



CacheProcessor

State DataExcl *

*

Sp Ex *

Tag**

*XSp Sh *

SL00

00

SM00

11

Shared

Upgrade-Request (X)

--X

ORB



CacheProcessor

State DataExcl *

*

Sp Ex *

Tag**

*XSp Sh *

SL00

00

SM00

11

Shared

Ack (X)

---

ORB

If SM,Become Dirty;

Flash Reset

Upgrade-Request (X)



CacheProcessor

State DataExcl *

*

Dirty *

Tag**

*XDirty *

SL00

00

SM00

00

Shared ---

ORB

flush the ORB, then quick bit operations


Forwarding Data Between EpochsForwarding Data Between Epochs

• predictable dependences cause frequent violationspredictable dependences cause frequent violations

• compiler inserts wait-signal synchronizationcompiler inserts wait-signal synchronization

Store XLoad X

synchronize to avoid violations

Wait

ForwardingWith

Store XSignal

Load X


Speculation in a Shared CacheSpeculation in a Shared Cache

Why?Why?

1) Shared-cache multithreaded architectures1) Shared-cache multithreaded architectures• eg. eg. simultaneous multithreadingsimultaneous multithreading

2) Context switch to another chain of speculation2) Context switch to another chain of speculation

3) Start new epoch while current epoch waits to commit3) Start new epoch while current epoch waits to commit

How?How?

replicate the speculative context


Support for Speculation in a Shared CacheSupport for Speculation in a Shared Cache


CacheProcessor

State Data- -

-

- -

Tag--

--- -

SL--

--

SM--

--

-

ORB


Support for Speculation in a Shared CacheSupport for Speculation in a Shared Cache

CacheProcessor

State Data- -

-

- -

Tag--

--- -

SL--

--

SM--

--

-

ORBSL--

--

SM--

--

ORB



OutlineOutline

• Details of our ApproachDetails of our Approach

PerformancePerformance– simulation infrastructuresimulation infrastructure

– single-chip multiprocessor performancesingle-chip multiprocessor performance

– scaling beyond chip boundariesscaling beyond chip boundaries

• ConclusionsConclusions


Simulation InfrastructureSimulation Infrastructure

Compiler system and tools based on SUIFCompiler system and tools based on SUIF– help analyze dependences, insert synchronizationhelp analyze dependences, insert synchronization

– produce produce MIPSMIPS binaries containing TLS primitives binaries containing TLS primitives

Benchmarks (all run to completion)Benchmarks (all run to completion)– buk, compress95, ijpeg, equakebuk, compress95, ijpeg, equake

SimulatorSimulator– superscalar, similar to superscalar, similar to MIPS R10KMIPS R10K– models all bandwidth and contention models all bandwidth and contention

detailed simulation!C

C

P

C

P

Crossbar


Pipeline ParametersPipeline Parameters

Issue WidthIssue Width 44Functional UnitsFunctional Units 2Int, 2FP, 1Mem, 1Bra2Int, 2FP, 1Mem, 1BraReorder Buffer SizeReorder Buffer Size 3232Integer MultiplyInteger Multiply 12 cycles12 cyclesInteger DivideInteger Divide 76 cycles76 cyclesAll Other IntegerAll Other Integer 1 cycle1 cycleFP DivideFP Divide 15 cycles15 cyclesFP Square RootFP Square Root 20 cycles20 cyclesAll Other FPAll Other FP 2 cycles2 cyclesBranch PredictionBranch Prediction GShare (16KB, 8 history bits)GShare (16KB, 8 history bits)


Memory ParametersMemory ParametersCache Line SizeCache Line Size 32B32BInstruction CacheInstruction Cache 32KB, 4-way set-assoc32KB, 4-way set-assocData CacheData Cache 32KB, 2-way set-assoc, 2 banks32KB, 2-way set-assoc, 2 banksUnified Secondary CacheUnified Secondary Cache 2MB, 4-way set-assoc, 4 banks 2MB, 4-way set-assoc, 4 banks Miss HandlersMiss Handlers 8 for data, 2 for insts8 for data, 2 for instsCrossbar InterconnectCrossbar Interconnect 8B per cycle per bank8B per cycle per bankMinimum Miss Latency to Minimum Miss Latency to Secondary CacheSecondary Cache

10 cycles10 cycles

Minimum Miss Latency to Local Minimum Miss Latency to Local MemoryMemory

75 cycles75 cycles

Main Memory BandwidthMain Memory Bandwidth 1 access per 20 cycles1 access per 20 cyclesIntra-Chip Communication LatencyIntra-Chip Communication Latency 10 cycles10 cycles

Inter-Chip Communication LatencyInter-Chip Communication Latency 200 cycles200 cycles


Benchmark Details: Regions and EpochsBenchmark Details: Regions and Epochs

ApplicationApplicationUnrolling Unrolling FactorFactor

Avg. Insts. Avg. Insts. per Epochper Epoch

Parallel Parallel CoverageCoverage

bukbuk 88 81.081.0 22.8%22.8%88 135.0135.0 33.8%33.8%

compress95compress95 11 196.7196.7 24.6%24.6%11 240.4240.4 22.7%22.7%

ijpegijpeg 3232 1467.91467.9 8.2%8.2%11 80.880.8 2.2%2.2%11 84.084.0 5.0%5.0%11 100.3100.3 6.7%6.7%

equakeequake 11 2925.52925.5 39.3%39.3%


Performance on a 4-Processor CMPPerformance on a 4-Processor CMP

ApplicationApplication

Overall Overall Region Region

SpeedupSpeedupParallel Parallel

CoverageCoverageProgram Program SpeedupSpeedup

bukbuk 2.262.26 56.6%56.6% 1.461.46

compress95compress95 1.271.27 47.3%47.3% 1.121.12

equakeequake 1.771.77 39.3%39.3% 1.211.21

ijpegijpeg 1.941.94 22.1%22.1% 1.081.08

program speedups are limited by coverage



2.26

1.27

1.771.94

0

0.5

1

1.5

2

2.5

buk compress95 equake ijpeg

Spee

dup

Region

56.6% 47.3% 39.3% 22.1%Parallel Coverage:



2.26

1.27

1.771.94

1.46

1.12 1.211.08

0

0.5

1

1.5

2

2.5

buk compress95 equake ijpeg

Spee

dup Region

Program

program speedups are limited by coverage

56.6% 47.3% 39.3% 22.1%Parallel Coverage:


Varying the Number of ProcessorsVarying the Number of ProcessorsN

orm

aliz

ed R

egio

n Ex

ecut

ion

Tim

e

buk and equake are memory-bound

compress95 and ijpeg are computation-intensive


Varying the Number of ProcessorsVarying the Number of ProcessorsN

orm

aliz

ed R

egio

n Ex

ecut

ion

Tim

e

buk and equake scale well

passing the homefree token is not a bottleneck


Performance of the ORB (on a 4-CMP) Performance of the ORB (on a 4-CMP)

ApplicationApplication

Average Average Flush Flush

Latency Latency (cycles)(cycles)

ORB Size (entries)ORB Size (entries)

AverageAverage MaximumMaximumbukbuk 13.9513.95 2.382.38 99compress95compress95 0.040.04 0.010.01 88equakeequake 0.130.13 0.040.04 1212ijpegijpeg 1.061.06 0.170.17 55

a small ORB is sufficient


Tracking Dependences Per Cache LineTracking Dependences Per Cache Line

Problem:Problem:– analagous to false sharing: false violationsanalagous to false sharing: false violations

– write-after-write dependences also cause violationswrite-after-write dependences also cause violations• but not a true dependence!but not a true dependence!

Solution:Solution:– track dependences at a word granularitytrack dependences at a word granularity

– have an SM and SL bit per word in each cache linehave an SM and SL bit per word in each cache line

is per-word state worth the extra overhead?


Tracking Dependences Per Cache LineTracking Dependences Per Cache Line

Does it do any good?Does it do any good?– not for our 4 benchmarksnot for our 4 benchmarks

– adding this support showed no improvementadding this support showed no improvement

Why not?Why not?– buk and equake have random access patternsbuk and equake have random access patterns

– compress95 is heavily synchronizedcompress95 is heavily synchronized

– ijpeg is unrolled to avoid false sharingijpeg is unrolled to avoid false sharing

existing techniques for avoiding false sharing can address this problem


Scaling Beyond Chip BoundariesScaling Beyond Chip Boundaries

Shared Memory

C

C

P

C

P

Crossbar

C

C

P

C

P

Crossbar

Node Node

200 Cycles

simulate architectures with 1, 2 and 4 nodes


Scaling Beyond Chip BoundariesScaling Beyond Chip BoundariesN

orm

aliz

ed R

egio

n Ex

ecut

ion

Tim

e

multi-chip systems benefit from TLS


Scaling Beyond Chip BoundariesScaling Beyond Chip BoundariesN

orm

aliz

ed R

egio

n Ex

ecut

ion

Tim

e

our scheme scales well


ConclusionsConclusions

The overheads of our scheme are low:The overheads of our scheme are low:– mechanisms to squash or commit are not a bottleneckmechanisms to squash or commit are not a bottleneck

– per-word speculative state is not always necessaryper-word speculative state is not always necessary

It offers compelling performance improvements:It offers compelling performance improvements:– program speedups from 8% to 46% on a 4-processor program speedups from 8% to 46% on a 4-processor

CMPCMP

– program speedups up to 75% on multi-chip architecturesprogram speedups up to 75% on multi-chip architectures

It is scalable:It is scalable:– coherence provides elegant data dependence trackingcoherence provides elegant data dependence tracking

seamless TLS on a wide range of architectures

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

Documents