1 Thread-Level Speculation Steffan Carnegie Mellon Thread-Level Speculation: Thread-Level Speculation: Towards Ubiquitous Parallelism Towards Ubiquitous Parallelism Greg Steffan Greg Steffan School of Computer Science School of Computer Science Carnegie Mellon University Carnegie Mellon University
57
Embed
Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science
Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science Carnegie Mellon University. Moore’s Law: the Original Version. Log transistors on a chip. Time. exponentially increasing resources. Moore’s Law: the Popular Interpretation. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Lower development cost:Lower development cost:– stamp out processor coresstamp out processor cores
Lower power:Lower power:– turn off idle processorsturn off idle processors
Tolerate defects:Tolerate defects:– disable any faulty processordisable any faulty processor
many advantages
C
C
P
C
P
Chip Multiprocessor (CMP)
Processors
Caches
8Thread-Level Speculation SteffanCarnegie Mellon
Multithreading in Every Scale of MachineMultithreading in Every Scale of Machine
Supercomputers
Threads
DesktopsChip Multiprocessor (CMP)
Cache
Proc Proc
(IBM Power4, SUN MAJC, Sibyte SB-1250)
multithreading on a chip!
Simultaneous-Multithreading(ALPHA 21464,
Intel Xeon)
Cache
Proc
9Thread-Level Speculation SteffanCarnegie Mellon
Improving Performance with a Chip MultiprocessorImproving Performance with a Chip Multiprocessor
C
C
P
C
P
C
P
C
P
C
C
P
Multiprogramming Workload:
ExecutionTime
improves throughput
Processor
Caches
Applications
10Thread-Level Speculation SteffanCarnegie Mellon
Improving Performance with a Chip MultiprocessorImproving Performance with a Chip Multiprocessor
C
C
P
C
P
C
P
C
P
C
C
P
Single Application:
need parallel threads to reduce execution time
C
C
P
C
P
C
P
C
P
Exec.Time
11Thread-Level Speculation SteffanCarnegie Mellon
How Do We Parallelize Everything?How Do We Parallelize Everything?
1) Programmers write parallel code from now on1) Programmers write parallel code from now on– time-consuming and frustratingtime-consuming and frustrating
– very hard to get rightvery hard to get right
– not a broad solutionnot a broad solution
2) System parallelizes automatically2) System parallelizes automatically– no burden on the programmerno burden on the programmer
– parallelize any applicationparallelize any application
automatic parallelization is preferred
12Thread-Level Speculation SteffanCarnegie Mellon
Current Technique: Prove IndependenceCurrent Technique: Prove Independence
IndependentIndependent
DependentDependent
for (i = 0;i < N;i++) A[i] = 0;
for (i = 1;i < N;i++) A[i] = A[i-1];
A[0]0A[1]0
A[2]0
A[1]A[0]A[2]A[1]
A[3]A[2]
need to fully understand data access pattern
13Thread-Level Speculation SteffanCarnegie Mellon
Ubiquitous Parallelization: How Close Are We?Ubiquitous Parallelization: How Close Are We?
Compiler can parallelize portions of numeric programsCompiler can parallelize portions of numeric programs– scientific, floating-point, array-based codesscientific, floating-point, array-based codes
– usually written in fortranusually written in fortran
What about everything else?What about everything else?– general-purpose, integer codesgeneral-purpose, integer codes
– written in C, C++, Java, etc.written in C, C++, Java, etc.
– little (if any) success so farlittle (if any) success so far
parallelize by proving independence
proving independence is infeasible
14Thread-Level Speculation SteffanCarnegie Mellon
The Main Culprit: IndirectionThe Main Culprit: Indirection
We need the next big performance winWe need the next big performance win– instruction-level parallelism will run out of gasinstruction-level parallelism will run out of gas
Multithreading will soon be everywhereMultithreading will soon be everywhere– we need automatically-parallelized programswe need automatically-parallelized programs
The scope of current techniques is extremely limitedThe scope of current techniques is extremely limited– proving independence is infeasibleproving independence is infeasible
A solution: Thread-Level Speculation (TLS)
16Thread-Level Speculation SteffanCarnegie Mellon
Thread-Level Speculation: the Basic IdeaThread-Level Speculation: the Basic Idea
exploit available thread-level parallelism
Exec.Time TLS
…*q*p…
Recover
…*q
violation
17Thread-Level Speculation SteffanCarnegie Mellon
OutlineOutline
The Software/Hardware Sweet SpotThe Software/Hardware Sweet Spot
• Compiler: Compiler: – break programs into speculative threadsbreak programs into speculative threads
• why: compiler has a global view of control flowwhy: compiler has a global view of control flow
• Hardware:Hardware:– track data dependencestrack data dependences
• why: software comparison of all addresses infeasiblewhy: software comparison of all addresses infeasible
– recover from failed speculationrecover from failed speculation• why: software buffering of all writes infeasiblewhy: software buffering of all writes infeasible
important: minimize additional hardware
25Thread-Level Speculation SteffanCarnegie Mellon
OutlineOutline
The Software/Hardware Sweet SpotThe Software/Hardware Sweet Spot
• Improving Value CommunicationImproving Value Communication
• ConclusionsConclusions
30Thread-Level Speculation SteffanCarnegie Mellon
GoalsGoals
1) Handle arbitrary memory accesses1) Handle arbitrary memory accesses– i.e. not just array referencesi.e. not just array references
2) Preserve single-thread performance2) Preserve single-thread performance– keep hardware support minimal and simplekeep hardware support minimal and simple
3) Apply to any scale of multithreaded architecture3) Apply to any scale of multithreaded architecture– within a chip and beyondwithin a chip and beyond
effective, simple, scalable
31Thread-Level Speculation SteffanCarnegie Mellon
RequirementsRequirements
1) Recover from failed speculation1) Recover from failed speculation• buffer speculative writes from memory buffer speculative writes from memory
2) Track data dependences 2) Track data dependences • detect data dependence violationsdetect data dependence violations
each has several implementation options
32Thread-Level Speculation SteffanCarnegie Mellon
Recover From Failed Speculation: Option 1Recover From Failed Speculation: Option 1
Augment the store buffer:Augment the store buffer:+ + common device in superscalar processorscommon device in superscalar processors
Add a new dedicated bufferAdd a new dedicated buffer+ + can design an efficient speculation mechanismcan design an efficient speculation mechanism
–– want to avoid large speculation-specific structureswant to avoid large speculation-specific structures
Proc
Recover From Failed Speculation: Option 2Recover From Failed Speculation: Option 2
34Thread-Level Speculation SteffanCarnegie Mellon
Augment the cacheAugment the cache+ + very common structurevery common structure
+ + relatively largerelatively large
Cache
Proc
just maintain single-thread performance
Recover From Failed Speculation: Option 3Recover From Failed Speculation: Option 3
35Thread-Level Speculation SteffanCarnegie Mellon
Tracking Data Dependences: Option 1Tracking Data Dependences: Option 1
Add a dedicated “3Add a dedicated “3rdrd-party” entity-party” entity–– want to avoid large speculation-specific structureswant to avoid large speculation-specific structures
–– does not scaledoes not scale
C
P
C
P
DependenceTracker
Load XStore X
violationdetected
36Thread-Level Speculation SteffanCarnegie Mellon
Tracking Data Dependences: Option 2Tracking Data Dependences: Option 2
Detection at the producerDetection at the producer• producer informed of all addresses consumedproducer informed of all addresses consumed
–– awkward: producer must notify consumer of any violationawkward: producer must notify consumer of any violation
C
P
C
P
Load X Store X
load address
violationdetected
Producer Consumer
37Thread-Level Speculation SteffanCarnegie Mellon
Tracking Data Dependences: Option 3Tracking Data Dependences: Option 3
Detection at the consumer Detection at the consumer • consumers informed of all addresses producedconsumers informed of all addresses produced
SimulatorSimulator– superscalar, a modernized superscalar, a modernized MIPS R10KMIPS R10K– models all bandwidth and contentionmodels all bandwidth and contention
detailed simulation!
C
C
P
C
P
Crossbar
44Thread-Level Speculation SteffanCarnegie Mellon
Will it Work at All of These Scales?Will it Work at All of These Scales?
Supercomputers
Threads
Desktops
yes: coherence scales up and down
Chip Multiprocessor (CMP)
Cache
Proc Proc
Simultaneous-Multithreading
Cache
Proc
45Thread-Level Speculation SteffanCarnegie Mellon
Performance on Multi-Chip SystemsPerformance on Multi-Chip Systems
our scheme is scalable
46Thread-Level Speculation SteffanCarnegie Mellon
Performance on General-Purpose ApplicationsPerformance on General-Purpose Applications
significant performance improvements
47Thread-Level Speculation SteffanCarnegie Mellon
OutlineOutline
The Software/Hardware Sweet SpotThe Software/Hardware Sweet Spot