1 Conservation Cores: Reducing the Energy of Mature Computations Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, Michael Bedford Taylor Department of Computer Science and Engineering, University of California, San Diego
30
Embed
1 Conservation Cores: Reducing the Energy of Mature Computations Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Conservation Cores: Reducing the Energy of Mature Computations
Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez,
Steven Swanson, Michael Bedford Taylor
Department of Computer Science and Engineering,
University of California, San Diego
2
Classical scalingDevice count S2
Device frequency SDevice power (cap) 1/S
Device power (Vdd) 1/S2
Utilization 1
Leakage limited scalingDevice count S2
Device frequency SDevice power (cap) 1/S
Device power (Vdd) ~1Utilization 1/S2
The Utilization Wall
Scaling theory– Transistor and power budgets no
longer balanced– Exponentially increasing
problem!
Experimental results– Replicated small datapath– More ‘Dark Silicon’ than active
Observations in the wild– Flat frequency curve– “Turbo Mode”– Increasing cache/processor ratio
3
The Utilization Wall
Scaling theory– Transistor and power budgets no
longer balanced– Exponentially increasing
problem!
Experimental results– Replicated small datapath– More ‘Dark Silicon’ than active
Observations in the wild– Flat frequency curve– “Turbo Mode”– Increasing cache/processor ratio
Expected utilization for fixed area and power budget
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
90nm 65nm 45nm 32nm
2x
2x
2x
4
The Utilization Wall
Scaling theory– Transistor and power budgets no
longer balanced– Exponentially increasing
problem!
Experimental results– Replicated small datapath– More ‘Dark Silicon’ than active
Observations in the wild– Flat frequency curve– “Turbo Mode”– Increasing cache/processor ratio
Utilization @ 300mm 2& 80w
3.3%
6.5%
17.6%
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
90nm
TSMC
45nm
TSMC
32nm
ITRS
3x
2x
5
The Utilization Wall
Scaling theory– Transistor and power budgets no
longer balanced– Exponentially increasing
problem!
Experimental results– Replicated small datapath– More ‘Dark Silicon’ than active
Observations in the wild– Flat frequency curve– “Turbo Mode”– Increasing cache/processor ratio
Synopsys CAD flow– Design Compiler– IC Compiler– TSMC 45nm
Simulation– Validated cycle-accurate C-Core
modules– Post-route netlist simulation
Power measurement– VCS+PrimeTime
Source
Rewriter
gcc
C-Core specification
generatorVerilog
generator
Synopsys flowSimulation
Powermeasurement
Hot Code
Hotspot analyzer
Cold code
23
Our cadre of C-Cores
We built 23 C-Cores for assorted versions of 5 applications
– Both patchable and non-patchable versions of each
– Varied in size from 0.015 to 0.326 mm2
– Frequencies from 0.9 to 1.9GHz
24
C-Core hot-code energy efficiency
0
2
4
6
8
10
12
14
16
djpegA
djpegB
mcf A
mcf B
vpr A
vpr B
cjpegA
cjpegB
bzip2A - F
Avg.
Per
-fu
nct
ion
eff
icie
ncy
(w
ork
/J)
Software
C-Core
C-Core (code changed)
Up to 16x as efficient as general purpose in-order core, 9.5x on average
25
System energy efficiency
C-Cores very efficient for targeted hot code
Amdahl’s Law limits total system efficiency
26
C-Core system efficiency with current toolchain
00.10.20.30.40.5
0.60.70.80.9
1
djpegA
djpegB
mcf A
mcf B
vpr A
vpr B
cjpegA
cjpegB
bzip2A - F
Avg.
No
rmal
ized
ap
pli
cati
on
ED
P
Software Patchable +coverage +lowleak
Base– Avg 33% EDP improvement
27
Tuning system efficiency
Improving our toolchain’s coverage of hot code regions– Good news: Small numbers of
static instructions account for most of execution
System rebalancing for cold-code execution– Improve performance/leakage
trade-offs for host core
28
00.10.20.30.40.5
0.60.70.80.9
1
djpegA
djpegB
mcf A
mcf B
vpr A
vpr B
cjpegA
cjpegB
bzip2A - F
Avg.
No
rmal
ized
ap
pli
cati
on
ED
P
Software Patchable +coverage +lowleak
C-Core system efficiency with toolchain improvements
With coverage + low leakage system components– Avg 61% EDP savings– Avg 14% increased execution time
With improved coverage – Avg 53% EDP improvement
29
Conclusions
The Utilization Wall will change how we build hardware– Hardware specialization increasingly promising
Conservation Cores are a promising way to attack the Utilization Wall– Automatically generated patchable hardware– For hot code regions: 3.4 – 16x energy efficiency – With tuning: 61% application EDP savings across system– 45nm tiled C-Core prototype under development @ UCSD
Patchability allows C-Cores to last for ten years – Lasts the expected lifetime of a typical chip