CDE: A Compiler-driven, CDE: A Compiler-driven, Dependence-centric, Eager- Dependence-centric, Eager- executing architecture for the executing architecture for the billion transistor era billion transistor era Carmelo Acosta Carmelo Acosta Sriram Vajapeyam Sriram Vajapeyam Alex Ramirez Alex Ramirez Mateo Valero Mateo Valero UPC-Barcelona UPC-Barcelona
22
Embed
CDE: A Compiler-driven, Dependence- centric, Eager-executing architecture for the billion transistor era Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CDE: A Compiler-driven, Dependence-CDE: A Compiler-driven, Dependence-centric, Eager-executing architecture for the centric, Eager-executing architecture for the
billion transistor erabillion transistor era
Carmelo AcostaCarmelo Acosta
Sriram VajapeyamSriram Vajapeyam
Alex RamirezAlex Ramirez
Mateo ValeroMateo Valero
UPC-BarcelonaUPC-Barcelona
MotivationMotivation Entering the billion transistor era
How to use the available Hw to increase performance Maintain cost and complexity under control Obtain a true general-purpose architecture
Do not limit High Performance to a single application class
Clustered architectures seem the way to go Avoid excessive dependence on the compiler Avoid impossible communication delays Avoid complex interconnection networks
Hierarchical program partitioning Both in the compiler and the hardware
OutlineOutline
Motivation The CDE architecture
Hierarchical program partitioning Epochs
• Selective Eager Execution
Dependence clusters Hierarchical architecture
Epoch Processing Core (EPC) Processing Elements (PE)
Program execution Related work Summary and conclusions
The CDE architectureThe CDE architecture The way CDE obtains performance
Rely on the compiler for code partitioning Hierarchical program view Matching hierarchical hardware
Use both run-time and compile-time speculation to keep the transistors occupied
How to achieve it The Dependence Cluster (DC) is the basic execution
unit Larger than one instruction
• Larger virtual instruction window
Reduces communication Amortizes speculation costs
• Commit, squash, and redo an entire DC
Hierarchical program partitioningHierarchical program partitioning Horizontal control epochs
/* check the environment list */ for (fp = xlenv; fp; fp = cdr(fp)) for (ep = car(fp); ep; ep = cdr(ep)) if (sym == car(car(ep))) return (cdr(car(ep)));
/* return the global value */ return (getvalue(sym));}
a) Source code
Eager executionEager execution
Traditional trace-scheduling
Bet on one direction Optimize frequent case Generate fix-up code for
infrequent case
Eager-execution Remove the branch Optimize each separate
case Squash the incorrect
trace
Hard to predict branch
Optimized trace + fix-up code
Remove branch and execute both paths
Alex Ramirez
Watch out for code growth. Use the ability of CDE code to do code distribution and code placement.
Dependence clustersDependence clusters
Essentially a set of dependent instructions May have dependencies with other DCs in the same
Fetches and processes epochs one at a time Speculatively branches to the next epoch
Epoch level sequencing Epoch level speculation
Renames live-in and live-outs of each epoch Out of order epoch execution
Dispatches the DC’s to the PE grid Coupled with the required data about the epoch
Renaming of live-in and live-outs
Processing Elements (PE)Processing Elements (PE)
MIPS-2000 like In-order Single-issue Short pipeline
Local register file Intra-DC dependencies
Communications manager
Inter-DC dependencies
F D E M W
Reg.file
Comms.
Program execution (Cycle 0)Program execution (Cycle 0)
The EPC fetches, processes, renames and starts Epoch’s execution.
DC #0
[4]
[5]
[6]
[7]
[8]
[14]
[15]
[2]
[3]
[0]
[1]
[16]
[17]
[8]
[9]
[8]
[10]
[11]
[12]
[13]
[18] [19] [20]
[21]
[22]D
C #1 DC #2 DC #3 DC #4 DC #5 DC #6 DC #7 DC #8 DC #9 DC
#10
Program execution (Cycle 1)Program execution (Cycle 1)
1 2 3 4 5 6
EPC
Initial EPC-PEs communication delay.
Program execution (Cycle 2)Program execution (Cycle 2)
0-IF
18-IF
19-IF
1 2 3 4 5 6
0
8
7
EPC
DCs #0, #7 and #8 start execution on their respective PEs.
DC#0
DC#7DC#8
Program execution (Cycle 3)Program execution (Cycle 3)
0-IF 0-ID
18-IF 18-ID
19-IF 19-ID
1 2 3 4 5 6
0
8
7
EPC
Each PE continues its execution as statically scheduled by the compiler.
DC#0
DC#7DC#8
Program execution (Cycle 4)Program execution (Cycle 4)
0-IF 0-ID 0-EX
1-IF
2-IF
16-IF
18-IF 18-ID 18-EX
19-IF 19-ID 19-EX
1 2 3 4 5 6
2 1
0
8
7
EPC
DCs #1 and #2 start execution on their respective PEs.
DC#0
DC#7DC#8
DC#1
DC#2
Program execution (Cycle 5)Program execution (Cycle 5)
0-IF 0-ID 0-EX 0-M
1-IF 1-ID
2-IF 2-ID
16-IF 16-ID
18-IF 18-ID 18-EX 18-M
19-IF 19-ID 19-EX 19-M
1 2 3 4 5 6
2 1
0
8
7
EPC
DC#0 (0-M) generates reg. t0, bypassed to next instruction (1-EX) and sent to DCs #1 and #2.
DC#0
DC#7DC#8
DC#1
DC#2
Program execution (Cycle 6)Program execution (Cycle 6)
0-IF 0-ID 0-EX 0-M 0-W
1-IF 1-ID 1-EX
2-IF 2-ID 2-EX
3-IF
16-IF 16-ID 16-EX
17-IF
18-IF 18-ID 18-EX 18-M 18-W
19-IF 19-ID 19-EX 19-M 19-W
2-IF
16-IF
1 2 3 4 5 6
2’ 1’
2 1
0
8
7
EPC
DCs #1’ and #2’ (next instance) start execution. Reg. t0 arrives at DCs #1 and #2.
DC#0
DC#7DC#8
DC#1
DC#2
DC#1’
DC#2’
Related WorkRelated Work
RAW Not Hierarchical HW Exploits Basic Block
parallelism
GPA Grid of ALUs High Instruction Fetch
requirements Exploits HyperBlock
parallelism
Multiscalar Horizontal but not vertical
Code partitioning SuperScalar Branch
treatment
ILDP Hardware only Approach Dynamic steer of
dependent instructions to PEs
Depends on an accumulator-based ISA
Trace Processors Hardware only Approach Dynamic paths are
captured in traces
Carmelo Alexis Acosta Ojeda
What I try to express here is that DCs are intended to be all possible threads from a single dynamic path where no parallelism can be exploited. In Trace Processors each PE must exploit the parallelism of each trace as opposite to CDE.
Carmelo Alexis Acosta Ojeda
But we have to keep in mind that RAW employ a level of execution hiearchy. Each tile executes a single stream, made by the compiler.
Implementation considerationsImplementation considerations Low complexity architecture based on
regularity Epoch Processing Core Grid of PE Communication network
High performance due to far-fetched speculation Large virtual instruction window
Strong dependence on the compiler Code partitioning, DC communication Epochs limit the scope of optimizations
Alex Ramirez
Should we start with hyperblocks?We can select for eager execution the very same branches that the compiler chose for predication.
Solving multiple problems at onceSolving multiple problems at once
CDE can also behave in a polimorphic way
Exploiting ILP Far-fetched speculation through Epoch speculation
Exploiting DLP No need to re-dispatch a DC to the PE's
Simply re-start the DC with new data
Alex Ramirez
OF COURSE, this is only assuming that we can do dynamic DC to PE assignment
Summary and conclusionsSummary and conclusions Hierarchical partitioning
Epoch speculation maintains transistors occupied Eager execution works around difficult branches
DC helps to keep complexity at bay Amortizes cost of speculation (squash, commit)
Scalable performance with more PE Increasing wire delays may limit scalability Rely on the compiler to minimize communication
Design in its initial stages Lots of unanswered questions
Specially regarding the memory hierarchy Feedback is welcome!
Alex Ramirez
Also: memory ordering and memory disambiguation. Memory aliasing, and out-of-order memory access.
Alex Ramirez
And rely on the PE allocator to minimize communication too. We may use compiler HINTS, but we can not afford static assignment (or else we need to RENAME the static assignment ... which also seemsa good idea: do not assign individual DCs, but a whole Epopch at a time. That is: dynamic epoch mapping, but static DC mapping within an epoch?)