1
Understanding the Energy-Delay Tradeoff of ILP-based Compilation Techniques on a VLIW Architecture
G. Pokam, F. BodinCPC 2004
Chiemsee, Germany, July 7-9
2
Motivation Source of complexity on high-
performance VLIW processors:
hardware duplication many FUs of different types (ALUs, LSUs, FPUs, BR, etc.) need large register file
Power growth factor IPCPower ~
compiler
architecturecomplexity
3
Motivation Assume a fixed ; does compiling
for higher ILP results in dissipating less power ?
Which issues (architecture, software, etc.) affect power when compiling for ILP ?Try to figure out what happens analytically !
4
Agenda Motivation Used metrics Energy model Tradeoff analysis Hyperblock example Experiments Conclusions
5
Metric Performance to energy ratio (PTE)
[Gonzales, R. et al.]
: nb. of oper. per Basic Block : average nb. of oper. per bundle : energy per Basic Block
EDelayEnergy BBBBBB
NIPCenergy
eperformancPTE
1
N
IPC
EBB
higher is better
6
Agenda Motivation Used metrics Energy model Tradeoff analysis Hyperblock example Experiments Conclusions
7
Energy Model The execution of a bundle dissipates
an energy :
Consider loop intensive kernels …
wnEPB nw
EEEIPCEEPB misssopwcw qlpmnn
Energybase cost
Energy due toexecution of bundle
Energy due toD-cache misses
Energy due toI-cache misses
EEEIPCEEPB misssopwcw qlpmnn
8
Agenda Motivation Used metrics Energy model Tradeoff analysis Hyperblock example Experiments Conclusions
9
Analysis Use as a lever for power
exploration
Assume R is a CFG region to be transformed into an ILP region H
a sufficient condition for this is given by
PTE
PTEPTE RH
10
Analysis Idea:
keep track of IPC values that improve energy efficiency
solve the PTE inequality at :
: avg. #oper. in transformed region : avg. #oper. in the CFG region R
IPC IPCIPC RH rmILPtransfo
IPCH
IPCR
11
EnNfnfN opHHHRRRmC
Analysis
IPCIPC
IPCR
RR CB
ArmILPtransfo
where
EsNEnNf sHHCHHHA
EsNEnfN sRRCRRRmB • f : exec. freq.• N : # of oper.• n : # of bundles• s : # stall due to dmiss • m : #of BB in region
C is a measure of extra work!
Shape of ILPtransform function depends on sign of C
12
vs. IPCH IPCR
C < 0: •exponential shape means high extra work!•dependence height mismatch•resource contention
C = 0• linear shape•negligible extra work
C > 0•Optimal scenario•Logarithmic shape
e.g. Hyperblock:Compensation code
e.g. Hyperblock:Instruction merging
13
Agenda Motivation Used metrics Energy model Tradeoff analysis Hyperblock example Experiments Conclusions
14
Hyperblock framework predication model via the select instruction
slct dest = cond, src1, src2
only hammock regions are considered
single entry – single exit hyperblock
15
Transformation heuristic
1. build the loop tree2. traverse the loop tree from innermost to
outermost loop3. evaluate profit for each candidate loop region4. propagate profit to CFG after transformation
PTEPTEPTE
original
originaldtransformeprofit
16
Agenda Motivation Used metrics Energy model Tradeoff analysis Hyperblock example Experiments Conclusions
17
Platform Lx Platform from STMicroelectronics
4-issue VLIW machine 64 GPRs, 8 CBRs 4 ALUs, 1 LD/ST, 2 MULs, 1 BU
Instruction-based energy model from STMicroelectronics
Lx compiler prefetch disabled only scalar optimizations (-O2)
18
Methodology Post-pass optimization
absciss
SALTOLx Compiler
.s file
.s file
Instrumentation:•BB frequency•Dmiss per BB
• Hyperblock formation • Hyperblock optimization
• instr. promotion• instr. merging• instr. renaming
source
phase 1
phase 2
• original CFG• selective hyperblock• all hyperblock
19
Results
negligible IPCimprovement
relative larger increase of operation count andstatic schedule length
?
20
Agenda Motivation Used metrics Energy model Tradeoff analysis Hyperblock example Experiments Conclusions
21
Conclusions Analytical scheme to understand the impact of ILP
compilation on energy Heuristic shows 17% energy-delay improvement on a
restricted hyperblock scheme programs suffer from limited ILP which quickly turns into
wasted energy need to go beyond compiler-centric approaches in order to
overcome ILP limitations What is missing:
impact of post-optimization passes has not been determined only a restricted hyperblock scheme has been evaluate
22
Thanks!