computational sprinting june 2012acg.cis.upenn.edu/sprinting/computational_sprinting_june_2012.pdf · Talk+Overview:+Computaonal+Sprin(ng+ 4 Computational Sprinting • Computaonal+Sprin(ng+
Post on 15-Jul-2020
7 Views
Preview:
Transcript
Computa(onal Sprin(ng
Arun Raghavan*, Yixin Luo+, Anuj Chandawalla+, Marios PapaeAhymiou+, Kevin P. Pipe+#,
Thomas F. Wenisch+, Milo M. K. Mar*n*
University of Pennsylvania, Computer and Informa(on Science*
University of Michigan, Electrical Eng. and Computer Science+ University of Michigan, Mechanical Engineering#
This work licensed under the Crea(ve Commons A.ribu*on-‐Share Alike 3.0 United States License
• You are free: • to Share — to copy, distribute, display, and perform the work
• to Remix — to make deriva(ve works
• Under the following condi*ons: • A.ribu*on. You must aQribute the work in the manner specified by the author or
licensor (but not in any way that suggests that they endorse you or your use of the work).
• Share Alike. If you alter, transform, or build upon this work, you may distribute the resul(ng work only under the same, similar or a compa(ble license.
• For any reuse or distribu(on, you must make clear to others the license terms of this work. The best way to do this is with a link to:
h.p://crea*vecommons.org/licenses/by-‐sa/3.0/us/
• Any of the above condi(ons can be waived if you get permission from the copyright holder.
• Apart from the remix rights granted under this license, nothing in this license impairs or restricts the author's moral rights.
2
Overview of My (Other) Research
• Mul*core memory systems • Adap(ve cache coherence protocols • Memory consistency: specifica(on & implementa(on
• “Why On-‐chip Cache Coherence is Here to Stay” Communica)ons of the ACM, July 2012
• Transac*onal memory • Seman(cs (what does “atomic” really mean?)
• Extending transac(on sizes & handling overflow • Conflict avoiding hardware via repair (true & false sharing)
• Hardware support for security • Goal: C/C++ as safe and secure as Java • Hardware/compiler co-‐design to provide memory safety
3 Computational Sprinting
Talk Overview: Computa(onal Sprin(ng
4 Computational Sprinting
• Computa(onal Sprin(ng • Unsustainable power for short, intense bursts of compute
• Feasibility study [HPCA’12] • Explored thermal, electrical, and architectural feasibility • Simula(on results:
• Significant responsiveness improvements in short bursts
• With same dynamic energy consump(on
• Preliminary results with sprin(ng on prototype-‐proxy • Characterize real energy/performance behavior • Sprin(ng can improve energy efficiency due to race to halt
Computa(onal Sprin(ng and Dark Silicon
• A Problem: “Dark Silicon” a.k.a. “The U(liza(on Wall” • Increasing power density; can’t use all transistors all the (me • Cooling constraints limit mobile systems
• One approach: Use few transistors for long dura(ons • Specialized func(onal units [Accelerators, GreenDroid] • Targeted towards sustained compute, e.g. media playback
• Our approach: Use many transistors for short dura(ons • Computa(onal Sprin(ng by ac(va(ng many “dark cores”
• Unsustainable power for short, intense bursts of compute
• Responsiveness for bursty/interac(ve applica(ons
• Our goal: responsiveness of 16W chip in 1W plamorm 5 Computational Sprinting Is this feasible?
Sprin(ng Challenges and Opportuni(es
• Thermal challenges • How to extend sprint dura(on and intensity? Latent heat from phase change material close to the die
• Electrical challenges • How to supply peak currents? Ultracapacitor/ba.ery hybrid • How to ensure power stability? Ramped ac*va*on (~100μs)
• Architectural challenges • How to control sprints? Thermal resource management
• How do applica(ons benefit from sprin(ng?
10.2x responsiveness for vision workloads via a 16-‐core sprint within 1W TDP
6 Computational Sprinting
Outline
7 Computational Sprinting
• Mo*va*on: “Dark Silicon” and interac*ve apps
• Computa(onal Sprin(ng • Feasibility Study • Performance Evalua(on
• Simula(on results • Characteriza(on of a real system
• Conclusion
0
Power Density Trends for Sustained Compute
8 Computational Sprinting
0
pow
er
time
time
tem
pera
ture
Tmax Thermal limit
> 10x
How to meet thermal limit despite power density increase?
Op(on 1: Enhance Cooling?
9 Computational Sprinting
Mobile devices limited to passive cooling
"
tem
pera
ture
time
Tmax
Op(on 2: Decrease Chip Area?
10 Computational Sprinting
Reduces cost, but sacrifices benefits from Moore’s law
Op(on 3: Decrease Ac(ve Frac(on?
11 Computational Sprinting
How do we extract applica*on performance from this “dark silicon”?
Accelerator Cores?
• Heterogeneous cores [Conserva(on Cores ASPLOS’10, GreenDroid IEEE Comm., QsCores MICRO’11]
• Ac(vate different parts of chip based on applica(on • Mobile chips already employ accelerators
12 Computational Sprinting
NVIDIA Tegra 2 (49 mm2) Apple A5 (122 mm2)
Design for Responsiveness
• Observa*on: today, design for sustained performance
• But, consider emerging interac*ve mobile apps… [Clemons+ DAC’11, Hartl+ ECV’11, Girod+ IEEE Signal Processing’11]
• Intense compute bursts in response to user input, then idle • Humans demand sub-‐second response (mes [Doherty+ IBM TR ‘82, Yan+ DAC’05, Shye+ MICRO’09, Blake+ ISCA’10]
13 Computational Sprinting
Peak performance during bursts limits what applica*ons can do
Computa*onal Sprin*ng Designing for Responsiveness
14
Parallel Computa(onal Sprin(ng
15 Computational Sprinting
Tmax po
wer
te
mpe
ratu
re
Parallel Computa(onal Sprin(ng
16 Computational Sprinting
Tmax po
wer
te
mpe
ratu
re
Effect of thermal capacitance
Parallel Computa(onal Sprin(ng
17 Computational Sprinting
Tmax po
wer
te
mpe
ratu
re
Effect of thermal capacitance
Parallel Computa(onal Sprin(ng
18 Computational Sprinting
Tmax po
wer
te
mpe
ratu
re
Effect of thermal capacitance
Parallel Computa(onal Sprin(ng
19 Computational Sprinting
Tmax po
wer
te
mpe
ratu
re
Effect of thermal capacitance State of the art: Turbo Boost 2.0
exceeds sustainable power
with DVFS (~25% for 25s)
Our goal: 10x, ~1s
Extending Sprint Intensity & Dura(on: Role of Thermal Capacitance
• Current systems designed for thermal conduc2vity
• Limited capacitance close to die
• To explicitly design for sprin(ng, add thermal capacitance near die
• Exploit latent heat from phase change material (PCM)
20 Computational Sprinting
Die
Die PCM
Augmented Sprin(ng with PCM
21 Computational Sprinting
Tmax
Mel(ng Point
pow
er
tem
pera
ture
Augmented Sprin(ng with PCM
22 Computational Sprinting
Sustainable (single core)
Mel(ng Point
Tmax po
wer
te
mpe
ratu
re
Augmented Sprin(ng with PCM
23 Computational Sprinting
pow
er
tem
pera
ture
Mel(ng Point
Re-solidifica(on
Sustainable (single core)
Tmax
Outline
24 Computational Sprinting
• Mo(va(on
• Computa(onal Sprin(ng • Feasibility Study
• Thermal
• Electrical • Hardware/sobware
• Performance Evalua(on • Conclusion
Thermal Challenges
• Goal: 1s of 16x sprin(ng (16 1W cores)
• How much thermal capacitance does sprin(ng need? • 16W for 1s = 16J of heat, for PCM with latent heat 100J/g • 150mg, which is 2mm thick on 64mm2 die
• Study based on thermal model of mobile phone
25 Computational Sprinting
0
20
40
60
80
0 0.2 0.4 0.6 0.8 1 1.2 0
20
40
60
80
0 5 10 15 20 25
tem
pera
ture
(oC
)
time (s) time (s)
1s of sprint time ~22s cool down time
Heat flux and transient similar to desktop chips
• 16x sprin(ng exceeds limits of today’s phone baQeries • Requires 16x peak current over baseline
• In contrast, ultracapacitors have high peak currents • Promising for sprin(ng (25F, 6.5g, 182J)
• But today, lower energy density than baQeries
• Leverage recent research on baQery-‐ultracap hybrids [Mirhoseini+ ‘11, Palma+ ‘03, Pedram+ ’10]
• Ac(ve research at all levels (baQeries, ultracaps, hybrids) • Cost of extra power/ground pins
Electrical Challenge #1: Peak Current Demands
26 Computational Sprinting
- Battery Voltage
Regulator Ultracap Sprinting
Chip +
supp
ly
volta
ge
time
Instantaneous
Core ac*va*on latency << sprint dura*on
Electrical Challenge #2: On-‐chip Voltage Stability
• 16x current spike can cause supply voltage instability • Poten(al (ming errors and state loss
• Study of core ac(va(on induced instability • SPICE model of board, package, chip
• Abrupt ac(va(on violates; gradual ac(va(on ok
27 Computational Sprinting
time
supp
ly
volta
ge
Hardware/SoAware Challenges
• Ac(va(on • Sprin(ng ac(vated when parallel work available
• Deac(va(on • Hardware detects impending overhea(ng
• Monitor thermal budget • Energy from ac(vity count + thermal model of system
• If thermal budget nearing, ask run(me to migrate • If soAware is unable to respond
• Dras(cally cut frequency to sustainable • ThroQle by factor of “number of cores”
28 Computational Sprinting
Performance Evalua*on
29
Methodology
• In-‐order x86 many-‐core simulator • 16 cores • 32K, 8-‐way L1, 4MB 16-‐way shared LLC, directory cache coherence, 60ns memory latency, dual-‐channel 4GB/s memory interface
• Energy es(mates from McPAT (1GHz, 1W, LOP) • Used to drive thermal model
• Workloads: • Vision kernels [SD-‐VBS], feature extrac(on app. [MEVBench]
30 Computational Sprinting
Sprin(ng Responsiveness
0.5
1
2
4
8
16
1-core Par-Sprint-150mg Par-Sprint-1.5mg
31 Computational Sprinting
image size (Megapixels)
norm
aliz
ed s
peed
up
Too little work
Lack of thermal capacity
Less compute More compute
Responsiveness Evalua(on
Average 10.2x improvement in responsiveness
32 Computational Sprinting
1
2
4
8
16
disparity sobel texture segment kmeans
norm
aliz
ed s
peed
up
feature
0.5 1 2 4 8
16 32 64
Parallelism & Energy
• No dynamic energy penalty when speedup linear
• Overall, 12% average dynamic energy increase
33 Computational Sprinting
0.5 0.75
1 1.25
1.5 1.75
2
norm
aliz
ed
ene
rgy
feature disparity sobel texture segment kmeans
norm
aliz
ed
spee
dup
Moving Beyond Simula*on: Can we make a real system sprint?
34
35
Characterize real system energy/performance How might sprinting behave on real system?
Mul(ple Cores: Energy and Performance
36 Computational Sprinting 1 core 2 cores 4 cores
0 0.2 0.4 0.6 0.8
1 1.2
feature disparity segment
norm
aliz
ed e
nerg
y
0 1 2 3 4
feature disparity segment
1 core 2 cores 4 cores norm
aliz
ed s
peed
up
0 5
10 15 20
feature disparity segment
1 core 2 cores 4 cores
pow
er
Power increases with core count
But < 2x for 4 cores
Why? background power
Energy consump*on improves due to early comple*on
Race to halt
4x
2x
0.9
1
1.1
1.2
1.6 2 2.4 2.8 3.2
feature disparity segment
frequency
DVFS: Energy and Performance
37 Computational Sprinting
0 0.5
1 1.5
2 2.5
1.6 2 2.4 2.8 3.2
feature disparity segment
norm
aliz
ed s
peed
up
0 5
10 15 20
1.6 2 2.4 2.8 3.2
feature disparity segment
pow
er
Power increases by ~3x for 2x speedup
frequency
0 0.2 0.4 0.6 0.8
1 1.2
1.6 2 2.4 2.8 3.2 norm
aliz
ed e
nerg
y
frequency
Frequency Voltage 1.6 GHz 0.95V 3.2 GHz 1.25V
Min frequency is not always min energy
2x
3x
Energy/Performance/Power Tradeoffs
38 Computational Sprinting
0 1 2 3 4 5 6 7 8
0 0.5 1 1.5
0 1 2 3 4 5 6 7 8
0 2 4 6
norm
aliz
ed s
peed
up
normalized energy
norm
aliz
ed s
peed
up
power
4 cores, 1.6 GHz
4 cores, 3.2 GHz
1 core, 1.6 GHz
1 core, 3.2 GHz
1 core, 1.6 GHz
1 core, 3.2 GHz
4 cores, 1.6 GHz
4 cores, 3.2 GHz What if this is the maximum
sustainable power?
Sprin*ng enables higher responsiveness and greater
energy efficiency
Max speedup
Min energy
Min power
Emula(ng Sprin(ng with Limited Energy Budget
• Emulate effect of being thermally constrained • Not currently physically limi(ng cooling constraints • Es(mate thermal capacity for sprin(ng based on energy
• Sprint opera(on: • Execute with all cores at maximum frequency • Monitor energy consump(on via MSR
• Terminate sprin(ng when energy capacity is exceeded • Migrate threads to single core
• Shutdown addi(onal cores • Lower frequency to minimum
39 Computational Sprinting
0 2 4 6 8
0 20 40 60 80 100
Effect of Thermal Capacity
40
Computational Sprinting
0
0.5
1
1.5
0 20 40 60 80 100 norm
aliz
ed e
nerg
y no
rmal
ized
spe
edup
Thermal Capacity (J)
Thermal Capacity (J)
Speedup sensi*ve to amount of computa*on
within sprint
Longer sprints enable greater energy saving
Energy overhead from extremely short sprints
Work in progress: constrain cooling system
Caveat: mobile system background power
characteris*cs likely differ
Conclusions
• Computa*onal Sprin*ng • Targets responsiveness by far exceeding sustainable opera(on • Exploit phase change material as thermal buffer
• Explored feasibility of sprin*ng • Promising avenues for managing electrical, thermal and architectural barriers to sprin(ng
• Order-‐of-‐magnitude improvements in responsiveness • Within the constraints of a 1W device
• Opportunity to rethink the stack around responsiveness
41 Computational Sprinting
42 Computational Sprinting
top related