Eliminating Squashes Through Learning Cross- Thread Violations in Speculative Parallelization for Multiprocessors Marcelo Cintra and Josep Torrellas University of Edinburgh http://www.dcs.ed.ac.uk/ home/mc University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu
47
Embed
Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization for Multiprocessors Marcelo Cintra and Josep Torrellas University.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Eliminating Squashes Through Learning Cross-Thread
Violations in Speculative Parallelization for Multiprocessors
Marcelo Cintra and Josep Torrellas
University of Edinburghhttp://www.dcs.ed.ac.uk/home/
mc
University of Illinoisat Urbana-Champaignhttp://iacoma.cs.uiuc.edu
Intl. Symp. on High Performance Computer Architecture - February 2002
2
Speculative Parallelization
Assume no dependences and execute threads in parallel
Track data accesses at run-time Detect violations Squash offending threads and restart
themfor(i=0; i<100; i++) { … = A[L[i]]+…
A[K[i]] = …}
Iteration J+2… = A[5]+…
A[6] = ...
Iteration J+1… = A[2]+…
A[2] = ...
Iteration J… = A[4]+…
A[5] = ...RAW
Intl. Symp. on High Performance Computer Architecture - February 2002
3
Squashing in Speculative Parallelization
Speculative Parallelization: Threads may get squashed
Dependence violations are statically unpredictable
False sharing may cause further squashes
Intl. Symp. on High Performance Computer Architecture - February 2002
4
Squashing
Useful work
Possibly correctwork
Wasted correctwork
Squash overhead
Produceri i+j Consumer
i+j+1 i+j+2
Wr
Rd
Time
Squashing is very costly
Intl. Symp. on High Performance Computer Architecture - February 2002
5
Contribution: Eliminate Squashes
Framework of hardware mechanisms to eliminate squashes
Based on learning and prediction of violations
Improvement: average of 4.3 speedup over plain squash-and-retry for 16 processors
Intl. Symp. on High Performance Computer Architecture - February 2002
6
Outline
Motivation and Background Overview of Framework Mechanisms Implementation Evaluation Related Work Conclusions
Intl. Symp. on High Performance Computer Architecture - February 2002
7
Types of Dependence Violations
No dependence Disambiguate lineaddresses
PlainSpeculation
Type Avoiding Squashes Mechanism
Intl. Symp. on High Performance Computer Architecture - February 2002
8
Types of Dependence Violations
False
Type
No dependence
Avoiding Squashes
Disambiguate lineaddresses
Disambiguate wordaddresses
Mechanism
PlainSpeculation
Delay &Disambiguate
Intl. Symp. on High Performance Computer Architecture - February 2002
9
Types of Dependence Violations
False
Same-word,predictable value
Type
No dependence
Avoiding Squashes
Disambiguate lineaddresses
Disambiguate wordaddresses
Compare valuesproduced and consumed
Mechanism
PlainSpeculation
Delay &Disambiguate
ValuePredict
Intl. Symp. on High Performance Computer Architecture - February 2002
10
Types of Dependence Violations
False
Same-word,predictable value
Type
No dependence
Same-word,unpredictable value
Avoiding Squashes
Disambiguate lineaddresses
Disambiguate wordaddresses
Compare valuesproduced and consumed
Stall thread,release when safe
Mechanism
PlainSpeculation
Delay &Disambiguate
ValuePredict
Stall&ReleaseStall&Wait
Intl. Symp. on High Performance Computer Architecture - February 2002
11
Plain Speculation Delay&Disambiguate
ValuePredictStall&ReleaseStall&Wait
Learning and Predicting Violations
Monitor violations at directory Remember data lines causing violations Count violations and choose mechanism
potential violation
violations and squashes
violation andsquash
age
age
age
Plain Speculation Delay&Disambiguate
ValuePredictStall&ReleaseStall&Wait
Plain Speculation
Intl. Symp. on High Performance Computer Architecture - February 2002
12
Outline
Motivation and Background Overview of Framework Mechanisms Implementation Evaluation Related Work Conclusions
Intl. Symp. on High Performance Computer Architecture - February 2002
13
Delay&Disambiguate
Assume potential violation is false Let speculative read proceed Remember unresolved potential violation Perform a late or delayed per-word
disambiguation when consumer thread becomes non-speculative– No per-word access information at directory– No word addresses on memory operations
Squash if same-word violation
Intl. Symp. on High Performance Computer Architecture - February 2002
14
Delay&Disambiguate (Successful)
Useful work
Produceri i+j Consumer
i+j+1 i+j+2
Wr
RdDelayeddisambiguationoverhead
Time
Intl. Symp. on High Performance Computer Architecture - February 2002
15
Delay&Disambiguate (Unsuccessful)
Useful work
Possibly correctwork
Wasted correctwork
Squash overhead
Produceri i+j Consumer
i+j+1 i+j+2
Wr
Rd
Delayeddisambiguationoverhead
Time
Intl. Symp. on High Performance Computer Architecture - February 2002
16
ValuePredict
Predict value based on past observed values– Assume value is the same as last value written
(last-value prediction, value reuse, or silent store)– More complex predictors are possible
Provide predicted value to consumer thread Remember predicted value Compare predicted value against correct value
when consumer thread becomes non-speculative
Squash if mispredicted value
Intl. Symp. on High Performance Computer Architecture - February 2002
17
Stall&Release
Stall consumer thread when attempting to read data
Release consumer thread when a predecessor commits modified copy of the data to memory– Provided no intervening thread has a modified
version
Squash released thread if a later violation is detected
Intl. Symp. on High Performance Computer Architecture - February 2002
18
Stall&Release (Successful)
Useful work
Produceri i+j Consumer
i+j+1 i+j+2
Wr
RdStalloverhead
Time
Intl. Symp. on High Performance Computer Architecture - February 2002
19
Stall&Wait
Stall consumer thread when attempting to read data
Release consumer thread only when it becomes non-speculative
Intl. Symp. on High Performance Computer Architecture - February 2002
20
Outline
Motivation and Background Overview of Framework Mechanisms Implementation Evaluation Related Work Conclusions
Intl. Symp. on High Performance Computer Architecture - February 2002
21
Baseline Architecture
Processor+ Caches
Memory
DirectoryController
Network
Speculation module with per-lineaccess bits
Per-word access bits in caches
Conventional support forspeculative parallelization:
Intl. Symp. on High Performance Computer Architecture - February 2002
22
Speculation Module
One entry per memory line speculatively touched
Load and Store bits per line per thread Mapping of threads to processors and
ordering of threadsLineTag
ValidBit
LoadBits
StoreBits
… … … …
Global Memory DisambiguationTable (GMDT)(Cintra, Martínez, and Torrellas – ISCA’00)
Intl. Symp. on High Performance Computer Architecture - February 2002