. .................................................................................................................................................................................................................. UTILIZING DARK SILICON TO SAVE ENERGY WITH COMPUTATIONAL SPRINTING . .................................................................................................................................................................................................................. COMPUTATIONAL SPRINTING ACTIVATES DARK SILICON TO IMPROVE RESPONSIVENESS BY BRIEFLY BUT INTENSELY EXCEEDING A SYSTEM’S SUSTAINABLE POWER LIMIT.SPRINTING CAN SAVE ENERGY AND IMPROVE RESPONSIVENESS BY ENABLING EXECUTION IN CHIP CONFIGURATIONS THAT, ALTHOUGH THERMALLY UNSUSTAINABLE, IMPROVE ENERGY EFFICIENCY.THIS ENERGY SAVINGS CAN IMPROVE THROUGHPUT EVEN FOR LONG- RUNNING COMPUTATIONS.REPEATEDLY ALTERNATING BETWEEN SPRINT AND IDLE MODES WHILE MAINTAINING SUSTAINABLE AVERAGE POWER CAN OUTPERFORM STEADY-STATE COMPUTATION AT THE PLATFORM’S THERMAL LIMIT. ......Researchers predict increasingly underutilized chip area (dark silicon) with continued CMOS scaling for all designs, 1,2 and the impact of dark silicon in mobile de- vices is likely to be particularly acute. Because of limited heat-venting capability, researchers project that only 10 percent of transistors on a mobile chip can remain active on a sus- tained basis. 3 To extract value from dark silicon in such thermally constrained settings, our earlier work proposed computational sprinting, an approach to improve respon- siveness for interactive applications by briefly exceeding sustainable thermal limits through activating otherwise idle cores and increas- ing frequency 4 (see the ‘‘Computational- Sprinting Overview’’ sidebar). Here, we further explore counterintuitive findings regarding the energy implications of sprinting. Our original simulation study naively concluded that sprinting by activat- ing reserve cores would be energy-neutral at best, because it assumed that chip power was solely due to active cores. In fact, real chips incur significant background power overheads (above the idle power) to activate even a single core because of shared ‘‘uncore’’ components, such as caches and intercon- nects. Sprinting activates dark silicon cores to use these resources more efficiently, and it also lets them idle sooner by completing computation faster. Previous literature has noted this race-to-idle effect. 5-8 However, under thermal constraints, ‘‘sprinting to idle’’ reveals new ways to use dark silicon to conserve energy. By leveraging sprinting to perform a staccato sprint-and-rest execu- tion, wherein the system alternates between Arun Raghavan Laurel Emurian University of Pennsylvania Lei Shao Marios Papaefthymiou Kevin P. Pipe Thomas F. Wenisch University of Michigan Milo M. K. Martin University of Pennsylvania ....................................................... 20 Published by the IEEE Computer Society 0272-1732/13/$31.00 c 2013 IEEE
9
Embed
UTILIZING DARK SILICON TO SAVE ENERGY WITH ...web.eecs.umich.edu/~twenisch/papers/ieee-micro13-dasi.pdfWhen does sprinting save energy? The opportunity to save energy with sprinting
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
COMPUTATIONAL SPRINTING ACTIVATES DARK SILICON TO IMPROVE RESPONSIVENESS BY
BRIEFLY BUT INTENSELY EXCEEDING A SYSTEM’S SUSTAINABLE POWER LIMIT. SPRINTING
CAN SAVE ENERGY AND IMPROVE RESPONSIVENESS BY ENABLING EXECUTION IN CHIP
CONFIGURATIONS THAT, ALTHOUGH THERMALLY UNSUSTAINABLE, IMPROVE ENERGY
EFFICIENCY. THIS ENERGY SAVINGS CAN IMPROVE THROUGHPUT EVEN FOR LONG-
RUNNING COMPUTATIONS. REPEATEDLY ALTERNATING BETWEEN SPRINT AND IDLE
MODES WHILE MAINTAINING SUSTAINABLE AVERAGE POWER CAN OUTPERFORM
STEADY-STATE COMPUTATION AT THE PLATFORM’S THERMAL LIMIT.
......Researchers predict increasinglyunderutilized chip area (dark silicon) withcontinued CMOS scaling for all designs,1,2
and the impact of dark silicon in mobile de-vices is likely to be particularly acute. Becauseof limited heat-venting capability, researchersproject that only 10 percent of transistors ona mobile chip can remain active on a sus-tained basis.3 To extract value from darksilicon in such thermally constrained settings,our earlier work proposed computationalsprinting, an approach to improve respon-siveness for interactive applications by brieflyexceeding sustainable thermal limits throughactivating otherwise idle cores and increas-ing frequency4 (see the ‘‘Computational-Sprinting Overview’’ sidebar).
Here, we further explore counterintuitivefindings regarding the energy implications
of sprinting. Our original simulation studynaively concluded that sprinting by activat-ing reserve cores would be energy-neutral atbest, because it assumed that chip powerwas solely due to active cores. In fact, realchips incur significant background poweroverheads (above the idle power) to activateeven a single core because of shared ‘‘uncore’’components, such as caches and intercon-nects. Sprinting activates dark silicon coresto use these resources more efficiently, andit also lets them idle sooner by completingcomputation faster. Previous literature hasnoted this race-to-idle effect.5-8 However,under thermal constraints, ‘‘sprinting toidle’’ reveals new ways to use dark siliconto conserve energy. By leveraging sprintingto perform a staccato sprint-and-rest execu-tion, wherein the system alternates between
20 Published by the IEEE Computer Society 0272-1732/13/$31.00 �c 2013 IEEE
sprinting and idling at a duty cycle that main-tains a thermally sustainable average power,sprinting can use dark silicon to actually im-prove throughput over conventional steady-state execution at a thermally sustainable pace.
This article develops a simple analyticalmodel to describe the conditions on speedupand platform power under which sprint-to-
idle and sprint-and-rest can improve bothperformance and energy efficiency. We pres-ent results from a hardware testbed to empir-ically validate the model and show that bothsprint-to-idle and sprint-and-rest do indeedprovide faster and more energy-efficientmodes of operation than simple slow-and-steady sustainable execution.
When does sprinting save energy?The opportunity to save energy with
sprinting arises because of the power requiredto keep a chip’s shared (that is, uncore) com-ponents active to support the operation ofeven a single core. Despite being classified asoverhead, this background power is funda-mental to acceptable performance. For in-stance, last-level caches and interconnectsreduce miss rate and penalty by staging andmoving data to the core. In the system weuse for evaluations, this uncore power istwice the power of a single core at minimumoperating frequency (Table 1). System designsembrace this overhead and seek to amortize itby scaling up computation resources (for ex-ample, adding more cores) to compute in amore energy-efficient way per operation; asimilar argument was made for the cost of par-allel-computing systems by Wood and Hill.9
Because of this background power, speedingup computation can save energy by racing-to-idle and reducing the time for which thebackground components remain active.5-8
Although it is desirable to operate in theseenergy-efficient modes, the thermal con-straints that give rise to dark silicon can alsopreclude such sustained operation. By inter-mittently activating dark silicon, computa-tional sprinting can reduce the energy peroperation via sprint-to-idle. To explain thesepreviously seen advantages,10 we present ananalytical model using the parameters inTable 2. The model separates core (active),uncore (background), and idle power basedon empirical data from a real system. The
model doesn’t explicitly include workload-dependent variation in power consumption,because our results show low variation acrossthe workloads used in the evaluation. For agiven frequency, the system we evaluate closelyfits a model with a fixed background powerand active power growing linearly in directproportion to the number of active cores (Ta-ble 1). Both background power (Puncore(f ))and active core power (Pcore(N ,f )) varywith frequency. The model compares the en-ergy of sprinting relative to a sustainable base-line execution at minimum power with onecore at fmin (that is, at power Pcore(1,fmin))while obtaining a speedup of S (N ,f ).
Using this model, we analyze the energyimpact of sprinting, considering threequestions:
� When does sprinting improve energyefficiency per operation?
� When does sprinting result in a net en-ergy savings when also considering theimplications of nonnegligible idle power?
� When is repeated sprint-and-rest moreenergy-efficient than steady computa-tion at sustainable power in thermallyconstrained environments?
Sprinting to reduce energy per operationThe total energy a system consumes while
computing is as follows:
Energy during computation¼ (core powerþ background power)� computation time
Table 1. Testbed power profile.
Frequency
f
Cores
N
Total
power
Ptotal(N,f )
Normalized
power
Ptotal(N,f )/
Ptotal(1,fmin)
Peak
speedup
S(N,f ) Mode Pcore(1,f ) Puncore(f )
1.6 GHz 1 �10 W 1� 1� Sustainable 3.3 W 6.6 W
1.6 GHz 2 �13 W 1.3� 2� � 3.3 W 6.6 W
1.6 GHz 4 �20 W 2� 4� Parallel sprint 3.3 W 6.6 W
3.2 GHz 1 �20 W 2� 2� � 10 W 10 W
3.2 GHz 2 �30 W 3� 4� � 10 W 10 W
3.2 GHz 4 �50 W 5� 8� Parallel þ DVFS
sprint
10 W 10 W
.................................................................................................................................................................................� DVFS: dynamic voltage and frequency scaling.
To compare energy relative to the baselineexecution with a single core operating at fre-quency fmin, we can express the core powerand computation time in terms of their base-line counterparts:
Ecompute(N ,f )
¼ (Pcore(N ,f )þ Puncore(f ))tcompute(N ,f )
¼ (N � Pcore(1,f )þ Puncore(f ))
� tcompute(1,fmin)
S (N ,f )
By setting N to 1 and f to fmin:
Ecompute(1,fmin)
¼ (Pcore(1,fmin)þ Puncore(fmin))
� tcompute(1,fmin)
Thus, the relative energy is
Relative energy
¼ Ecompute(N ,f )
Ecompute(1,fmin)
¼ N � Pcore(1,f )þ Puncore(f )
S (N ,f )(Pcore(1,fmin)þ Puncore(fmin))
(1)
For more energy-efficient computation(relative energy< 1), the higher-power sprintmodes must deliver a minimum speedup:
S (N ,f ) >N � Pcore(1,f )þ Puncore(f )
Pcore(1,fmin)þ Puncore(fmin)
For the particular case of sprinting byactivating additional cores without frequencyscaling (f ¼ fmin), expressing the minimumrequired speedup in terms of the ratio ofbackground power to core power leads tothe following inferences:
S (N ,fmin) >
N þ Puncore(fmin)
Pcore(1,fmin)
1þ Puncore(fmin)
Pcore(1,fmin)
(2)
With no background power, we wouldneed ideal, linear speedup (S (N ,fmin) ¼ N )for any additional cores to even be energy-neutral with single-core operation. However,nonzero background power reduces the mini-mum speedup required: the increase in thedenominator is much larger than thecorresponding increase in the numerator.Therefore, with higher background power,
Table 2. Parameters used in energy analysis.
Parameter Derivation Meaning
N Input Number of cores
f Input Operating frequency
fmin Input Minimum operating frequency
tcompute(N,f ) Input Computation time with N cores at frequency f
Pidle Input Idle power
Puncore(f ) Input Background power at frequency f
Pcore(1,f ) Input Core power with 1 core at frequency f
Pcore(N,f ) N � Pcore(1,f ) Core power with N cores at frequency f
Ptotal(N ,f ) Pcore(N ,f )þ Puncore(f ) Sprint (total) power with N cores at frequency f
Psustainable Assumed as Ptotal(1,fmin) Maximum thermally sustainable power
S(N ,f ) tcompute(N,f )=tcompute(1,fmin) Speedup at N cores and frequency f relative to
baseline at 1 core and fmin
tidle(N,f ) tcompute(N,f )� tcompute(N,f )S(N,f ) Idle time after computing with N cores at frequency f
Ecompute(N,f ) Ptotal(N ,f ) � tcompute(N,f ) Energy required for active computation with N cores
at frequency f
Eidle(N,f ) Pidle � tidle(N ,f ) Energy spent idling after computing with N cores
at frequency f
Etotal(N ,f ) Ecompute(N ,f )þ Eidle(N ,f ) Total energy across time required for computation
even sublinear speedup can be more energy-efficient than operating in the lowest-powermode. For example, in our evaluation system,in which Puncore(fmin) is 6.6 W and Pcore(1,fmin)is 3.3 W, a speedup exceeding 2� with fourcores is sufficient for saving energy.
Implications of idle powerThe above analysis considered the energy
spent only while the system was active. How-ever, after completing a task sooner bysprinting, the system returns to its idlestate, which typically incurs nonzero idlepower. A more conservative model shouldconsider this idle energy and compare totalsystem energy over the same time period asthe slower baseline execution.
The minimum speedup required for core-only sprinting to be energy-efficient is therefore:
S (N ,fmin) >
N þ Puncore(fmin)� Pidle
Pcore(1,fmin)
1þ Puncore(fmin)� Pidle
Pcore(1,fmin)
(4)
Equation 4 is similar to the previous re-quirement on speedup (Equation 2), exceptthat the background power is now offset byPidle; if the idle power is zero, the two equa-tions are identical. We typically expect idlepower to be lower than background powerin most reasonably engineered systems. In asprint-enabled system, when sufficient speedup
is obtained, it can be possible to use dark sili-con to sprint-to-idle to save energy. The op-portunity for saving energy grows with thedifference between background and idlepower. For example, in our evaluationsystem—where Pidle is 5 W, Puncore(fmin) is6.6 W, and Pcore(1,fmin) is 3.3 W, as statedpreviously—speedup exceeding 3� with fourcores is sufficient for saving energy.
Sprint-and-restLong-running computations are conven-
tionally executed at a steady, sustainable operat-ing mode that consumes less power than therate at which the system can dissipate heat(allowing the chip to operate indefinitely).However, in a sprint-enabled system, we canalso consider an operating regime that alter-nates between sprint and rest periods. Providedthat the sprint periods are short enough to re-main within temperature bounds, and that therest periods are long enough to dissipate theaccumulated heat, such a sprint-and-rest oper-ation mode is also sustainable indefinitely.
More directly, sprint-and-rest operation issustainable as long as the average—but not nec-essarilyinstantaneous—powerdissipationoverasprint-and-rest cycle is atorbelowtheplatform’ssustainable power dissipation. If the system’sthermal power limit is Psustainable when operatingwith 1 core at fmin, then any increase of cores orfrequency is therefore unsustainable and mustonly be engaged for a fraction of time,rsprint(N ,f ), after which the system must idle:
Psprint-and-rest Psustainable
Ptotal(N ,f )rsprint(N ,f )
þ Pidle(1 � rsprint(N ,f ))
Psustainable
Therefore,
rsprint(N ,f ) Psustainable � Pidle
Ptotal(N ,f ) � Pidle
(5)
Because active computation occurs onlyin the sprint phase, the effective speedup ofsprint-and-rest operation over steady baselineoperation at Psustainable is
To sprint, a chip must offer an operatingpoint where its peak power greatly exceedsthe sustainable power dissipation of its cool-ing system. Existing mobile chips have beendesigned with peak power envelopes easilydissipated via passive cooling, and thus areinadequate for our study. Instead, we studya sprinting testbed system as a proxy forthe thermal characteristics of a futuresprint-enabled device. This system uses anIntel Core i7 2600 quad-core Sandy Bridgechip.10 The chip can operate with one tofour cores over a frequency range from1.6 GHz to 3.2 GHz. Table 1 shows thischip’s power and peak performance for therelevant subset of these modes. The poweris measured using energy counters that reflectpackage-level energy consumption.11 Table 1shows that, for each frequency, the totalpower is well approximated as a fixed back-ground power and a per-core power multi-plied by the number of active cores.
We reduce the chip’s heat-venting capac-ity by removing its heat sink and tuning itsfan to dissipate 10 W, so that the tempera-ture settles at the maximum recommendedoperating temperature of 75�C when run-ning with a single core at 1.6 GHz; allother modes are not thermally sustainableand hence are sprint modes. The chip idlesat �5 W, causing its initial temperature tosettle at 50�C. The chip’s internal heat
spreader (�20 g of copper) can store up to188 J of heat for a 25�C temperature in-crease, allowing several seconds of sprintingwith the four-core, 1.6-GHz (Parallel)and four-core, 3.2-GHz (Parallel+DVFS) modes.
We evaluate the performance and energyimpact of sprinting on our test platformusing a suite of vision kernels. The kernels’inputs are sized such that each completeswithin a single sprint without exhaustingthermal capacitance; other work investigatescases when thermal capacitance is exhaustedmidsprint.10
Performance and energy impact ofsprinting
We first measure the speedup provided bysprinting. We then predict available energysavings from sprinting-to-idle according tothe model and compare those predictionsto empirical energy measurements.
SpeedupWhen sprinting is employed with four
cores at 1.6 GHz, the maximum potentialspeedup over the single-core baseline is 4�;with Parallel+DVFS sprinting, maximumspeedup is 8� (4� cores, 2� frequency).Figure 1a shows the achieved speedups forour vision kernels: Parallel+DVFS enables6.3� speedup on average, whereas Para-llel sprinting achieves 3.5� speedup onaverage. These speedups imply that sprintingallows this system to complete in just a fewseconds what would have taken more than15 seconds if constrained to operate only insustainable (nonsprinting) mode.
Energy efficiency of sprintingWe can predict the energy impact of
sprinting from the measured speedups andthe models developed earlier. Relative tothe sustainable baseline (Pcore(1,fmin)), Equa-tion 1 with S (N ,f ) ¼ 6:3� predicts theactive energy to be 0.79� the baseline forParallel+DVFS, and 0.57� for Para-
llel+DVFS sprinting, with S (N ,f ) ¼3:5� (substantial energy savings). Thelower component (darker) of Figure 1bshows the experimentally observed energyfor the duration of the sprint. The measuredenergy (0.77� and 0.60�) for both sprinting
modes closely matches the model predictionand confirms that sprinting enables lower en-ergy per operation.
Energy efficiency of sprint-to-idleAfter a computation completes during a
sprint, however, the system continues to con-sume some energy while idle. The modelaccounts for this idle energy in Equation 3.Relative to sustained operation, the predictedtotal energy is 1.20� the baseline for Par-allel+DVFS sprinting (a net energy loss),
whereas Parallel sprinting consumes0.93� the baseline energy (continuing toprovide a net energy savings). We accountfor the additional idle energy in the upper,lighter component of each bar in Fig-ure 1b. Again, the measured energy confirmsthe model (21 percent energy overhead withParallel+DVFS sprinting, and 6 percentenergy savings with Parallel sprinting).
Sprint-and-rest for long running computationsWe experiment with sprint-and-rest in
our system with both the Parallel andParallel+DVFS modes of sprinting.From Equation 5, for the Parallel
sprint drawing 20 W, the fraction of timespent in sprint mode cannot exceed 1=3:1(rsprint(N ,f ) ¼ 1 : 3:1) to provide a sustain-able average power. To avoid overheatingduring an individual sprint, sprint durationfor Parallel sprinting cannot exceed 20seconds. Thus, we selected a sprint durationof 5 seconds and a rest duration of 10.5 sec-onds (rsprint(N ,f ) ¼ 1 : 3:1). Similarly, forParallel+DVFS sprinting at 50 W, weselected the sprint duration as 1.5 seconds(less than the 3 seconds maximum sprintduration), and a rest duration of 12.3 sec-onds (rsprint(N ,f ) ¼ 1 : 9:1). Substitutingthese values in Equation 6, we would expectParallel sprinting to be 23 percent moreenergy-efficient, and Parallel+DVFS
sprinting to be 22 percent less energy-efficient, compared to sustained executionat a constant 10 W of power.
Figure 2a shows the power traces for theSobel workload executed on the testbed formore than 8 minutes with sustained andsprint-and-rest modes (for both Parallel
and Parallel+DVFS sprinting) under thepreviously discussed duty cycles. Figure 2bcompares the resulting cumulative workdone when operating in these modes. TheParallel+DVFS sprint-and-rest modeunderperforms sustained execution at thesustainable thermal limit by 21 percent.However, the Parallel mode of sprint-and-rest performs 20 percent more workover sustained operation on average.
T he above results provide ample moti-vation for chip designers to further
optimize idle power; although the chip used
0
2
4
6
8
Nor
mal
ized
sp
eed
up
sobel disparity segment kmeans feature texture
0.0
0.5
1.0
1.5
Nor
mal
ized
ene
rgy
IdleSprint
para
llel+
DVF
Spa
ralle
l-onl
y
para
llel+
DVF
Spa
ralle
l-onl
y
para
llel+
DVF
Spa
ralle
l-onl
y
para
llel+
DVF
Spa
ralle
l-onl
y
para
llel+
DVF
Spa
ralle
l-onl
y
para
llel+
DVF
Spa
ralle
l-onl
y
sobel disparity segment kmeans feature texture
para
llel+
DVF
Spa
ralle
l-onl
y
para
llel+
DVF
Spa
ralle
l-onl
y
para
llel+
DVF
Spa
ralle
l-onl
y
para
llel+
DVF
Spa
ralle
l-onl
y
para
llel+
DVF
Spa
ralle
l-onl
y
para
llel+
DVF
Spa
ralle
l-onl
y
(a)
(b)
Figure 1. Speedup (S (N,f )) (a) and energy breakdown (Ecompute(N ,f );Eidle(N,f ))
normalized to the one-core 1.6-GHz sustainable baseline (Ecompute(1,fmin)) (b)
for four cores at 3.2 GHz and 1.6 GHz. Sprinting with parallelism and DVFS
results in an average speedup of 6.3� (a) using 23 percent less energy to
perform the computation (dark component in part b). However, idle energy
after sprinting (light component in part b) causes a total energy loss of
20 percent. Sprinting with parallelism alone results in lower average speedup
(3�), but saves energy (6 percent), even considering idle energy.
for the evaluation already achieves 10-to-1ratios between peak and idle power, theanalytical as well as empirical results indicatethat energy efficiency gains of sprinting wouldincrease if idle power is further reduced. Thus,by keeping dark silicon as dark as possiblewhen idle, and operating dark silicon beyondsustainable power when active, computationalsprinting has the potential to continueharnessing Moore’s law to deliver intenseperformance when necessary, and even saveenergy in the process. MICR O
His research interests include parallel-computer architectures and programmingmodels. Raghavan has a BEng in electronicsand communications engineering from R.V.College of Engineering, India.
Laurel Emurian is a PhD candidate in theDepartment of Computer and InformationScience at the University of Pennsylvania.Her research interests include energy-efficientsystems and mobile architecture. Emurian hasan MSE in computer and informationscience from the University of Pennsylvania.
Lei Shao is a PhD candidate in the Depart-ment of Mechanical Engineering at theUniversity of Michigan. His research interestsinclude microscale heat transfer and energyconversion. Shao has an MS in mechanicalengineering from the University of Michigan.
Marios Papaefthymiou is a professor in theDepartment of Electrical Engineering andComputer Science and Chair of ComputerScience and Engineering at the University ofMichigan. He is also a cofounder and chiefscientist of Cyclos Semiconductor, a start-upcompany specializing in energy-efficientchips for power-critical applications. Pa-paefthymiou has a PhD in electrical engi-neering and computer science from theMassachusetts Institute of Technology.
Kevin P. Pipe is an associate professor in theDepartment of Mechanical Engineering andholds joint appointments with the AppliedPhysics Program and Electrical Engineeringand Computer Science Department at theUniversity of Michigan. His research inter-ests include microscale heat transfer, opto-electronic devices, thermoelectric energy
conversion, scanning-probe techniques,photovoltaic energy conversion, and organicand hybrid organic and inorganic devices.Pipe has a PhD in electrical engineering fromthe Massachusetts Institute of Technology.
Thomas F. Wenisch is the Morris WellmanFaculty Development Assistant Professor ofElectrical Engineering and Computer Scienceat the University of Michigan and a memberof the Advanced Computer Architecture Lab.His research focuses on computer architecturewith emphasis on multiprocessor and multi-core systems, multicore programmability,smartphone architecture, datacenter architec-ture, and performance evaluation methodol-ogy. Wenisch has a PhD in electrical andcomputer engineering from Carnegie MellonUniversity.
Milo M. K. Martin is an associate professorin the Department of Computer andInformation Science at the University ofPennsylvania. He coleads Penn’s ComputerArchitecture and Compilers Group. Hisresearch interests include multiprocessorand multicore computer architecture, com-piler and hardware support for security, andprogramming models for next-generationarchitectures. Martin has a PhD in computerscience from the University of Wisconsin-Madison.
Direct questions and comments aboutthis article to Arun Raghavan, LVN 302,3330 Walnut St., Philadelphia, PA 19104;[email protected].