Temperature-Aware Idle Time Distribution for Leakage Energy Optimization

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 7, JULY 2012 1187

Temperature-Aware Idle Time Distribution forLeakage Energy Optimization

Min Bao, Alexandru Andrei, Petru Eles, Member, IEEE, and Zebo Peng, Senior Member, IEEE

AbstractLarge-scale integration with deep sub-micron tech-nologies has led to high power densities and high chip workingtemperatures. At the same time, leakage energy has become thedominant energy consumption source of circuits due to reducedthreshold voltages. Given the close interdependence between tem-perature and leakage current, temperature has become a majorissue to be considered for power-aware system level design tech-niques. In this paper, we address the issue of leakage energy opti-mization through temperature aware idle time distribution (ITD).We first propose an offline ITD technique to optimize leakage en-ergy consumption, where only static idle time is distributed. To ac-count for the dynamic slack, we then propose an online ITD tech-nique where both static and dynamic idle time are considered. Toimprove the efficiency of our ITD techniques, we also propose ananalytical temperature analysis approach which is accurate and,yet, sufficiently fast to be used inside the energy optimization loop.

Index TermsIdle time distribution (ITD), leakage energy opti-mization, system level design, temperature aware design.

I. INTRODUCTION

A. Background

T ECHNOLOGY scaling and ever increasing demand forperformance have resulted in high power densities in cur-rent circuits, which have also led to increased chip temperature.Due to the strong dependence of leakage current on tempera-ture, growing temperature leads to an increase in leakage cur-rent and, consequently, energy, which, again, produces highertemperature. Thus, temperature is an important parameter to betaken into consideration for energy optimization.Energy optimization for embedded systems has been exten-

sively researched. At system level, dynamic voltage selection(DVS) is one of the preferred approaches for reducing theoverall energy consumption [1], [2]. This technique exploitsthe available slack time to achieve energy efficiency by re-ducing the supply voltage and frequency such that the executionof tasks is stretched within their deadline.There are two types of slacks: 1) static slack, which is due to

the fact that, when executing at the highest (nominal) voltage

Manuscript received November 12, 2010; revised March 28, 2011; acceptedMay 10, 2011. Date of publication July 12, 2011; date of current version June01, 2012.M. Bao, P. Eles, and Z. Peng are with the Department of Computer and In-

formation Science, Linkping University, Linkping 58183, Sweden (e-mail:[email protected]).A. Andrei is with the Department of Computer and Information Science,

Linkping University, Linkping 58183, Sweden, and also with Ericsson AB,Linkping 58183, Sweden.Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TVLSI.2011.2157542

level, tasks finish before their deadlines even when executingtheir worst numbers of cycles (WNC) and 2) dynamic slack,due to the fact that most of the time tasks execute less thantheir WNC. Offline DVS techniques [3] can only exploit staticslack, while online approaches [7], [8] are able to further reduceenergy consumption by exploiting the dynamic slack.However, very often, not all available slack should or can be

exploited and certain slackmay still exist after DVS. An obvioussituation is when the lowest supply voltage is such that, even ifselected, a certain slack interval is left. Another reason is the ex-istence of the critical voltage [6]. To achieve the optimal energyefficiency, DVSwould not execute a task at a voltage lower thanthe critical one, since, otherwise, the additional static energyconsumed due to the longer execution time is larger than theenergy saving due to the lowered voltage. During the availableslack interval, the processor remains idle and can be switchedto a low power state.Due to the strong inter-dependence between leakage power

and temperature, different distributions of idle time will lead todifferent temperature distributions and, consequentially, energyconsumption. In this paper, we address the issue of optimizingleakage energy consumption through distribution of both staticand dynamic slack time.

B. System Level Temperature Modeling

Temperature aware system level design methods rely on theavailability of temperature modeling and analysis tools. Systemlevel temperature modeling approaches are mostly based on theduality between heat transfer and electrical phenomena. Hotspot[7] is both an architectural level and system level temperaturesimulator. The basic idea of Hotspot is to build an equivalentcircuit of thermal resistances and capacitances capturing boththe architecture blocks and the elements of the thermal package.In [8], a similar temperature modeling approach was proposedwhich speeds up the thermal analysis through dynamic adapta-tion of the resolution.However, temperature analysis time with approaches like the

two mentioned above are too long to be affordable inside a tem-perature aware system level optimization loop. There has beensome work on establishing fast system level temperature anal-ysis techniques. They also build on the duality between heattransfer and electrical phenomena and are based on very restric-tive assumptions in order to simplify the model. In [9], the au-thors have assumed that: 1) no cooling layer is present; 2) thereis no interdependency between leakage current and tempera-ture; and 3) the whole application executes at a constant voltage.The models in [10] and [11] consider variable voltage levelsbut maintain the first two limitations above. The most general

1063-8210/$26.00 2011 IEEE

1188 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 7, JULY 2012

analytical model is proposed in [12] which considers coolinglayers as well as the dependency between leakage and temper-ature. However, this approach is limited to the case of a uniquevoltage level throughout the application. In order to support ourITD technique proposed in this paper, we introduce a fast andaccurate temperature analysis technique, which eliminates allthree limitations mentioned above and can be used inside a tem-perature aware system level optimization loop.

C. Related Work

Several approaches to system level temperature aware designhave been discussed in literature. Temperature management isutilized to control the temperature of processors for improvingsystem reliability [13]. In [14], the authors proposed a techniquefor temperature management by scaling the processor speed. In[15], the authors addressed the issue of scheduling and map-ping of a set of tasks with real-time constraints on multi-pro-cessors for peak temperature minimization. Techniques for tasksequencing combined with DVS to reduce the peak tempera-ture of a processor were proposed in [10]. Several approachesaiming at reducing temperature variations or temperature gradi-ents across the chip, e.g., [16], were proposed.A considerable amount of work has been published on per-

formance optimization under thermal and real-time constraints.Zhang et al. [17] proposed voltage assignment techniques to op-timize the performance of a set of periodic tasks working underthermal constraints. In [18], the authors proposed approachesto optimize throughput by task sequencing under thermalconstraints. An online speed adaptation technique for homo-geneous multi-processors with the target of maximizing totalthroughput was proposed by Rao et al. in [19]. Temperatureaware DVS techniques considering the leakage/temperaturedependency were proposed in [2] and [20].In this paper we address the issue of optimizing leakage en-

ergy consumption through distribution of idle time. The onlywork, to our best knowledge, previously addressing this issue is[21] and [22]. In [21], the authors proposed an approach to dis-tribute idle time for applications consisting of one single task ex-ecuting at a constant given supply voltage. Thus, their approachcannot optimize the distribution of idle time among multipletasks which also execute at different voltages. The same limita-tion also holds for [22], where a pattern-based ITD for leakageenergy optimization considering one single task was proposed.The pattern-based approach generates uniform idle time distri-bution over the whole application and, thus, is not appropriatefor ITD among multi-task applications where tasks have dif-ferent amounts of energy consumption and execute at differentvoltage levels.

D. Main Contributions

In this paper, we make the following main contributions.11) We propose an offline ITD approach to optimize leakageenergy consumption for a set of periodic tasks. Static slackis distributed globally among tasks which are executed atdifferent discrete voltage levels.

1Preliminary results regarding the offline ITD approach have been publishedin [23].

2) We propose, based on the offline ITD approach, an onlineITD technique where both static and dynamic slack aredistributed.

3) We propose a fast and accurate analytical temperaturemodel which eliminates all the three limitations men-tioned in Section I-B, by considering the followingaspects: a) the interdependence between leakage powerand temperature; b) multiple cooling layers of the chip; c)non-smooth power consumption generated due to multiplediscrete supply voltage levels of the processor.

E. Paper OrganizationIn Section II we introduce the power and application models.

In Section III we give a motivational example.We formulate theproblem in Section IV. In Section V we introduce our analyt-ical thermal model. We then propose the offline ITD approach,which distributes only static slack, in Section VI. Based on theoffline ITD approach, we present our online ITD technique inSection VII. Finally, experimental results and conclusions arepresented in Sections VIII and IX.

II. PRELIMINARIES

A. Power ModelFor dynamic power we use the following equation [24]:

where , and denote the effective switched capacitance,supply voltage, and frequency, respectively. The leakage poweris expressed as follows [25]:

(1)

where is the leakage current at a reference temperature,is the current temperature, and and are technology de-

pendent coefficients. In Section V-B we will use a piecewiselinear approximation of this model, as proposed, for example, in[26]. According to it, the working temperature range ,where and are the ambient and the maximal workingtemperature of the chip, is divided into several sub-ranges. Theleakage power inside each sub-range is modeled by alinear function: , where and are con-stants characteristic to each interval.

B. Application ModelThe application is captured as a task graph . A node

represents a computational task, while an edgeindicates the data dependency between two tasks. Each taskis characterized by the following six-tuple:

where , and are task s worst case, bestcase and expected number of clock cycles to be executed. Theexpected number of clock cycles is the arithmetic meanvalue of the probability density function of the actual executedcycles , i.e., , whereis the probability that a number of clock cycles are executedby task . We assume that the probability density functions of

BAO et al.: TEMPERATURE-AWARE IDLE TIME DISTRIBUTION FOR LEAKAGE ENERGY OPTIMIZATION 1189

TABLE IMOTIVATIONAL EXAMPLE: APPLICATION PARAMETERS

Fig. 1. Motivational example: static idle time distribution. (a) First ITD. (b)Second ITD.

the execution cycles of different tasks are independent. rep-resents the supply voltage at which the task is executed. Thesupply voltage can be either constant for all tasks, or it canbe calculated by a DVS algorithm, e.g., our temperature awareDVS technique proposed in [20]. Further, and repre-sent the deadline and the effective switched capacitance.The application is mapped and scheduled on a processor

which has two power states: active and idle. In the active statethe processor can operate at several discrete supply voltagelevels. When the processor does not execute any task, it can beput to the idle state, consuming a very small amount of leakagepower .

III. MOTIVATIONAL EXAMPLE

A. Static Idle Time DistributionLet us consider an application consisting of seven tasks which

share a global deadline of 96.85 ms. The worst case workload,(in clock cycles), and average switched capacitance, ,

are given in Table I. The tasks run on a processor with a fixedsupply voltage and frequency of 0.6 V and 132 MHZ, respec-tively. The corresponding execution times are given inTable I. Based on the performance of this processor, there exists6 ms static slack, , in each execution period of this applica-tion. Fig. 1 gives two ways of distributing . The first distri-bution (first ITD), as shown in Fig. 1(a), places the wholeafter the last task, while the second distribution (second ITD),in Fig. 1(b), divides the static slack into three segments andplaces the three idle slots after execution of task , and ,respectively.For simplicity, in this example, we ignore both energy and

time overhead due to switching between the active and idlemode. The two different ITDs will lead to different temperatureand leakage power profiles. The average working temperature

of each task, as well as the leakage energy consump-

TABLE IISTATIC ITD: LEAKAGE ENERGY COMPARISON

TABLE IIIMOTIVATIONAL EXAMPLE: AN ACTIVATION SCENARIO

tion, are shown in Table II. is the total leakage energyconsumption of the whole application. Comparing for thefirst and second ITD, we can observe that around 10% reductionof leakage energy consumption can be achieved.The leakage energy reduction is due to the modified working

temperature of the chip which has a strong impact on the leakagepower. It is also important to mention that the table reflects thesteady state (not the startup mode), for which energy minimiza-tion is targeted. This means that the starting temperature foris identical to the temperature at the end of the previous period.

B. Dynamic ITDThe ITD approach outlined in the previous section is an of-

fline static onewhich assumes that tasks execute theirWNC and,thus, it only distributes the static slack. However, in reality, mostof the time, there are huge variations in the number of cycles ex-ecuted by a task, from one activation to the other, which leads toa large amount of dynamic slack. For the task set introduced inthe previous section, let us imagine the activation scenario givenin Table III where the columns and contain the actualexecuted workload (in clock cycles) and the corresponding ac-tual execution time of each task, respectively. represents thedynamic slack generated due to the actual number of cycles ex-ecuted by task (it is the difference between the andof the task). For this activation scenario, tasks , andexecute their worst case workload, while , and executeless than their worst case workload and, thus, generate dynamicslack. The total amount of dynamic slack is18.7 ms.Fig. 2(a) illustrates the distribution of idle time slots during

the above online activation scenario if we use the offline ITD ap-proach which distributes static slack as illustrated in Fig. 1(b).In this case, the dynamic slack is placed where it is gener-ated ( is placed after terminates). Table IV shows the cor-responding working temperature and leakage energy consump-tion of each task as well as the total leakage energy consump-


Fig. 2. Motivational Example: ITD. (a) First ITD: Execution scenario with onlystatic ITD. (b) Second ITD: Execution scenario with both static and dynamicITD.

TABLE IVDYNAMIC ITD: LEAKAGE ENERGY COMPARISON

tion, which is 7.98 J. However, leakage energy can be reducedby distributing the dynamic slack more wisely. For example, atrun-time, whenever a task terminates, the idle time slot lengthfollowing this task is calculated by taking into consideration thecurrent time and the current chip temperature. Fig. 2(b) showsthe ITD determined in this way. The corresponding total leakageenergy consumed, as given in Table IV, is 7.32 J which meansa leakage energy reduction of 8%. This reduction is due to thefurther lowered working temperature of the energy hungry tasks

, and , which is achieved by ITD considering both staticand dynamic slack.

IV. PROBLEM FORMULATIONWe consider a set of periodic tasks executed

in the order . For each task , the six-tuple:is given. Corresponding

to the supply voltage that task is executed at, the worstcase execution time , best case execution time , andexpected execution time can be directly calculated.For each iteration of the application, the total static slack

is constant and computed by (2)

(2)

where represents the deadline of the last task in the exe-cution order, and is the sum of the worst case execu-tion time of all tasks. The total dynamic slack for each executioniteration is varying due to execution time variation of tasks. Forone iteration, is calculated as follows:

where represents the actual execution time of task in thisiteration. conforms to a distribution with the expected exe-cution time as the arithmetic mean value of the probabilitydensity function .The total available slack for one iteration is equal to

the sum of the static slack and dynamic slack. During the processor can be switched to idle mode

consuming the power . The time and energy overhead forswitching the processor to and from the idle state are and ,respectively. Idle slots can be placed after the execution of anytask. The length of an idle slot after task is denoted as ,and the sum of all idle slots should be equal with thetotal available idle time . Note that the time overhead isincluded in the slot length .We will, formulate the following two ITD problems.1) ITD with only static slack: Static idle time distribution(SITD).

2) ITD with both static and dynamic slack: Static and dy-namic idle time distribution (DITD).

A. ITD With Only Static Slack: SITDLet us consider the scenario in which each task is always

executed with the worst case workload: . In thisscenario, for each iteration, the available slack is constant andknown: , where is computed by (2).For one iteration, the total energy consumption of the task set

can be expressed as follows:

where and are the total dynamic andleakage energy of all tasks. is the total energyoverhead when the processor is switched to/from idle state,where is a binary variable indicating whether task isfollowed or not by an idle slot. is thetotal energy consumption during the idle time .The dynamic energy consumption of each task , is fur-

ther computed as

where is the supply voltage the task is executed at. rep-resents the worst case execution time of task . As the supplyvoltage and are constants, the total dynamic energy

is hence constant and independent from the distri-bution of idle time. The total energy consumption during idletime is

where is the power consumption of the processor in thelow power mode and . Similar to is alsofixed and independent from ITD, as is constant with givensupply voltages.The leakage energy consumption of each task is a func-

tion of both temperature and supply voltage

(3)


where describes the temperature of the processor duringexecution of task . With given supply voltages isinfluenced by the distribution of idle time slots, so the leakageenergy consumption depends on the ITD.We need to distribute the static slack to minimize the total

leakage energy consumption and the energy overheads due toswitching: . With given supplyvoltages , and a fixed distribution of idle time slots, the samepower pattern is periodically executed on the processor. As thetask set is executed for a large number of iterations, the pro-cessor temperature is converging to a steady state dynamic tem-perature curve (SSDTC). Once the processor has reached steadystate, the SSDTC will repeat periodically.Our SITD problem can be formulated as follows: given is a

set of tasks as defined earlier in this section. Theidle time slot length following each task and, implicitly,(the binary variable which represents whether task is followedby an idle time slot or not) are to be determined such that theobjective function (4) is minimized with the constraints (5) and(7) to be satisfied.

Problem Formulation 1 Static Idle Time Distribution

(4)

subject to:

(5)

(6)

(7)

in (4) represents the steady-state leakage energy con-sumption. The constraint in (5) requires that the sum of all idleslots lengths should be equal with the total available static slack, where is calculated by (2). The constraint in (6) guarantees

that the deadline of each task is satisfied. Finally, the constraintin (7) requires that the processor temperature throughout the ex-ecution of the task set should not exceed the maximal allowableworking temperature of the chip , where describesthe processor temperature during execution of task .

B. ITD With Both Static and Dynamic Slack: DITD

The above problem formulation ignores the execution timevariations of tasks at run-time and, implicitly, ignores the dy-namic slack. To deal with execution time variation and performdynamic slack distribution, the idle slot length following thetermination of a task should be determined, at run-time, basedon the actual time and processor temperature.Our problem formulation for DITD is as follows: given is a

set of periodic tasks as defined earlier in thissection. When task terminates at time , the idle time slotfollowing task s termination is determined such that (8)

is minimized, where is the total leakage energy

consumption of the remaining tasks , to beexecuted within the current iteration.

Problem Formulation 2 Dynamic Idle Time Distribution

Minimize:

(8)

subject to:

(9)

(10)(11)

The leakage energy consumption of each remaining taskis estimated corresponding to the case when the expected

workload is executed. is calculated according to (3) withthe difference that the expected execution time is used in-stead of as the upper limit for the integral. The constraint in(9) requires that the sum of all idle slots lengths should be equalwith the total available slack where is the time the currenttask terminates. The total available slack is computed withthe assumption that all the future tasks to are executedwith their expected workload . The dead-line of each task is guaranteed by the constraint in (10), whererepresents the deadline of task . Note that, the worst case

execution time is used in (10) in order to guarantee thedeadline of each task in the worst case. The constraint in (11)requires, similarly to (7), the processor temperature during ex-ecution of the task set to be lower than the maximal allowableworking temperature of the chip .

V. TEMPERATURE ANALYSIS

A. Temperature ModelThermal Circuit: In order to analyze the thermal behavior,

we build an equivalent RC thermal circuit based on the phys-ical parameters of the die and the package. Due to the fact thatthe application period can safely be considered significantlysmaller than the RC time of the heat sink, which, usually, is inthe order of minutes [27], the heat sink temperature stays con-stant after the state corresponding to the SSDTC is reached. ForSSDTC estimation, we, hence, can ignore the thermal capaci-tance (not the thermal resistance!) of the heat sink and build the2-RC thermal circuit shown in Fig. 3(a). and representthe temperature node for the die and the heat spreader respec-tively. stands for the processor power consumption as afunction of time. We obtain the values of , andfrom an RC network similar to the one constructed in Hotspot[7]. is calculated as the sum of the thermal resistance ofthe die and the thermal interface material (TIM), and as thesum of the thermal capacitance of the die and the TIM. isthe equivalent thermal resistance from the heat spreader to the


Fig. 3. Thermal circuit. (a) 2-RC thermal circuit. (b) 1-RC thermal circuit.

ground through the heat sink, and is the equivalent thermalcapacitance of the heat spreader layer.When the application period is significantly smaller than

the RC time of the heat spreader in the 2-RC thermal circuit,the heat spreader temperature stays constant after SSDTC isreached. In this case, we can simplify the 2-RC to an 1-RCthermal circuit [see Fig. 3(b)].Temperature Equations: For the 2-RC thermal circuit in

Fig. 3(a), we can describe the temperatures of and asfollows:

(12)

(13)

where and represent the temperatures at and ,respectively. The power consumption is the sum of thedynamic and leakage power, which are dependent on the supplyvoltage and .If, within a time interval, the power consumption stays con-

stant, the temperature at the beginning and end of the intervalcan be expressed by solving (12) and (13), where andare the temperatures of and at the beginning of the timeinterval, while and are the temperatures at the end ofthe time interval. , and are constant coef-ficients determined by the length of time interval, and by thevalues of , and

(14)(15)

B. SSDTC Estimation

As an input to the SSDTC calculation we have the voltagelevels, calculated by the DVS algorithm, and a given idle timedistribution, as illustrated in Fig. 4(a).In Fig. 4(a), we divide the execution interval of each active

state step into several sub-intervals. The total number of sub-intervals is denoted as . Each sub-interval is short enough suchthat the temperature variation is small and the leakage power canbe treated as constant inside the sub-interval.

is the power consumption for each sub-interval. When the processor is in the active state during the th

sub-interval, is computed by (16), where and arethe supply voltage and processor temperature at the start of theth sub-interval

(16)

Fig. 4. Temperature analysis. (a) Voltage pattern. (b) Steady-state dynamictemperature curve.

represents the dynamic power consumption whilerepresents the leakage power consumption

based on the piecewise linear leakage model discussed inSection II-A. When the processor is in idle state during the thsub-interval, the power consumption .As shown in Fig. 4(b), we construct the SSDTC by calcu-

lating the temperature values to . The relationship be-tween the start and end temperature of each sub-interval can bedescribed by applying (14) and (15) to all sub-intervals. Thus,we can establish a linear system with equations as shown by(17)(20). and are the temperature at the beginning ofthe th sub-interval

(17)(18)

(19)(20)

Due to periodicity, when dynamic steady state is reached, theprocessor and heat spreader temperature at the beginning of theperiod should be equal to the temperature at the end of the pre-vious period

(21)

Solving the above linear system (17)(21), we get the values forto and, hence, obtain the corresponding SSDTC. As

this system is a tridiagonal linear system, it can be solved effi-ciently, e.g., through LU decomposition with only oper-ations [28]. It should be mentioned that, in fact, two SSDTCsare obtained, one reflecting the temperature of the chip, and theother based on that of the heat spreader.

C. Transient Temperature Curve Estimation

The temperature calculated in the previous section (SSDTC)corresponds to the dynamic steady state reached after a suffi-cient number of iterations have been executed. However, thesame technique can be used to calculate any transient temper-ature curve (TTC), corresponding to an arbitrary time interval,as long as the length of the time interval is significantly smaller


than the RC time of the heat sink (which is in the order of min-utes). Under this assumption, as discussed earlier in this sec-tion, the thermal model in Fig. 3 can be used. The only differ-ence relative to the SSDTC calculation is that (21) is no longervalid. To estimate the transient temperature curve (TTC), thetemperature of and are given as input. The tempera-ture values: and are cal-culated by solving (17)(20).

VI. ITD WITH ONLY STATIC SLACK (SITD)In this section we discuss our solutions to the SITD problem,

as formulated in Section VI-A, which only considers staticslack. We first introduce our approach ignoring the overheads

and in Section VI-A. This approach will be used inSection VI-B where a general SITD technique is presented.

A. SITD Without Overhead (SITDNOH)Since, in this section, we ignore the overheads

, it results from (4) that the cost to be minimized is, which is the total leakage energy consumed

during task execution.Assuming that the execution interval of task is divided intosub-intervals, the leakage energy consumption of is the

sum of the leakage energy

(22)

where and represent the processor SSDTCtemperatures at the beginning and end of the th sub-interval andthe length of this sub-interval, respectively. The model in (1) isused to compute the leakage power, , in each sub-interval.Let us first assume that the chip (as well as the heat spreader)

temperature at the termination of each task is known and is in-dependent of the starting temperature of the task. Under this as-sumption, we can formulate SITDNOH as a convex nonlinearproblem shown in (23)(36), where the objective function to beminimized is the total leakage energy for all tasks .The optimization variables to be calculated are the idle slotlengths .

Formulation 1 SITD with No Overheads Consideration

Minimize:

(23)

Subject to:(24)

(25)

(26)

(27)(28)

(29)

(30)

(31)(32)

(33)(34)(35)

(36)

Equation (25) requires the sum of idle slots lengths to be equalwith the total available idle time: . Equation(26) guarantees that the deadline of each task is satisfied. Asmentioned above, the processor and heat spreader temperaturesat the end of task , and , are considered known andassigned by (27) and (28), respectively, where andare given constants. and are the processor temper-ature at the beginning and end of th sub-interval in the execu-tion of task , and are given by (29) similar to (17) and (19) inSection V-B. Equation (30) describes the same relationship forthe heat spreader temperature. and are the pro-cessor and heat spreader temperatures at the start of task ,and are dependent on the finishing temperature of the previoustask and the idle slot placed after . If we assume thatall idle slots are significantly shorter than the RC time ofthe heat spreader, then we can describe the processor temper-ature behavior during the idle slot by (31) and (33), based onthe 1-RC thermal circuit described in Section V-A. is thesteady-state temperature that the processor would reach ifwould be consumed for a sufficiently long time and is calculatedaccording to (35). is the sum of the two thermal resistancesand in Fig. 3(b). Under the same assumption as above,

the heat spreader temperature stays constant during the idle slotas shown in (32) and (34).2 Equations (31) and (32) calculatethe processor and heat spreader temperature at the end of theidle slot following task and, implicitly, the starting temper-ature of . Equation (33) and (34) compute the temperatureat the start of task , taking into consideration that this taskstarts after the idle period following task (the task set is ex-ecuted periodically). Finally, the constraint in (36) requires thatthe processor temperatures during execution of the task set,

, do not exceed the maximalallowable working temperature of the chip . The presentedformulation is a convex nonlinear problem, and can be solvedefficiently in polynomial time [29].SITDNOH Algorithm: The above formulation is based on the

particular assumption that the temperature at the end of a taskis known and fixed. However, in reality, this is not the case,2Idle periods are supposed to be short. If, exceptionally, they are not signifi-

cantly shorter than the heat spreader RC time, we use the 2-RC circuit to modelthe temperature during the idle period in (31)(34). This will not affect the con-vexity of the formulation.


Fig. 5. SITDNOH heuristics.

and the temperature and [(27) and (28)] at the ter-mination of a task depend on the starting temperature of thetask and, implicitly, on the distribution of the idle time. Thismakes the above formulation become a non-convex program-ming problem which is very time consuming to solve. In orderto solve the problem efficiently we have developed an iterativeheuristic outlined in Fig. 5.The heuristic starts with an arbitrary initial ITD, for example,

that the entire idle time is placed after the last task .Assuming this ITD and the given voltage levels, steady-statedynamic temperature analysis is performed, as described inSection V-B. Given the obtained SSDTC, the leakage energyconsumption corresponding to the assumed ITDis calculated. From the SSDTC we can also extract the finaltemperatures and for each task . Assuming this

and as the final temperatures in (27) and (28), wecan calculate the idle time using the convex optimizationformulated in (23)(35).From the new ITD resulted after the optimization, we calcu-

late a new SSDTC which provides new temperatures andat the end of each task . The new total leakage energy

consumption , corresponding to the updated ITD, isalso calculated. The process is continued assuming the new endtemperatures in (27) and (28) and the convex optimization pro-duces a new ITD.The iterations stop when the temperature converges (i.e.,

). However, it can happenthat, after a certain point, additional iterations do not signifi-cantly improve the ITD. Therefore, even if convergence has notyet been reached, the optimization is stopped if no significantenergy reduction has been achieved: .Our experiments have shown that maximum five iterations areneeded with and .

B. SITD With Overhead (SITDOH)The approach presented in Section VI-A is based on the as-

sumption that time and energy overheads for switching the pro-cessor to and from the idle state, and , are zero, whichis not the case in reality. If we consider the hypothetical casethat the end temperature of each task is known, the problemcan be formulated similar to (23)(35), with the main differ-ence that the total energy to be minimized is given in (4). Basedon this formulation, we could solve the SITDOH problem for

Fig. 6. SITDOH heuristics. (a) Step1, (b) Step2.

the real case, when the end temperatures are not supposed to beknown, similarly to the approach described in Fig. 5. However,the formulation with the objective function (4), due to the bi-nary variable , is a mixed integer convex programing problemwhich is very time consuming to solve. We, hence, propose anSITDOH heuristic based on the SITDNOH approach presentedin Section VI-A.Our SITDOH heuristic comprises two steps. In the first step

an optimization of the idle time distribution is performed byeliminating idle intervals whose lengths are smaller than a cer-tain threshold limit. In the second step, the ITD is further refinedin order to improve energy efficiency.A lower bound on the length of an idle slot can be

determined by considering the following two bounds.1) No idle slot is allowed to be shorter than , the total timeneeded to switch to/from the idle state.

2) The energy overhead due to switching should be compen-sated by the gain due to putting the processor into the idlestate. The energy gain for an idle interval is computed as

(37)

where is the processor temperature as a function oftime during the idle time interval . is the supplyvoltage for . is the leakage power in theactive state during the idle time interval. Thus, in orderfor the overhead to be compensated, we need .As depends on the temperature, the threshold lengthof an idle slot is not a given constant. Nevertheless, thislength will be always larger than ,where is the leakage power atthe maximum temperature at which the processor isallowed to run.

In conclusion, for the first step of the SITDOH heuristic [seeFig. 6(a)], we consider: .The basic idea of the first step is that no idle slot is allowed

to be shorter than . Thus, after running SITDNOH, theobtained ITD is checked slot by slot. If a slot length issmaller than , this slot will be removed. In order to achievethis, the particular constraint in (24), corresponding to slot


, is changed from to . After all slots havebeen visited and (24) updated, SITDNOH is performed again.The obtained ITD is such that all slots which in the previousiteration have been found shorter than have disappearedand the corresponding idle time has been redistributed amongother tasks. The process is repeated until no slot shorter than

has been identified.After step1, we still can be left with slots that are too short to

be energy efficient. There are the following two reasons for this.1) Due to the fact that the processor is running at a tempera-ture lower than the maximum allowed , it can happenthat the real is smaller than the one considered instep1.

2) Even if , which means that an energy reductiondue to the idle slot is obtained, energy efficiency can, pos-sibly, be improved by eliminating the slot and distributingthe corresponding idle time among other slots.

In the second step [see Fig. 6(b)], we start from the shortestidle slot and consider to eliminate it [by setting the corre-sponding constraint in (24)]. If the ITD obtained afterapplying SITDNOH is more energy efficient, the new ITD isaccepted. The process is continued as long as, by eliminating aslot, the total energy consumption is reduced.

VII. ITD WITH DYNAMIC AND STATIC SLACK (DITD)

The above SITD approach determines idle time settings as-suming that tasks always execute their WNC. However, in orderto exploit the dynamic slack, the slot length has to be deter-mined at run-time based on the values of the current time andtemperature after termination of task . In principle, calculatingthe appropriate implies the execution of a temperature awareITD algorithm similar to the one described in Section VI-B[with the objective function and constraints in (8)(10)]. Run-ning this algorithm online, after execution of each task, impliesa time and energy overhead which is not acceptable.To overcome the above problem, we have divided our DITD

approach into an offline and an online phase. In the offline phase,idle time settings for all tasks are pre-computed, based on pos-sible finishing times and finishing temperatures of the task. Theresults are stored in lookup tables (LUTs), one for each task. InFig. 7, we show two such tables. They contain idle time settingsfor combinations of possible termination times and finishingtemperatures of a task .

A. Online Phase

The online phase is illustrated in Fig. 7. Each time a taskterminates, the length of the idle time slot following the ter-mination of has to be fixed; the online scheme chooses the ap-propriate setting from the LUT , depending on the actualtime and temperature sensor reading. If there is no exact entry in

, corresponding to the actual time/temperature, the entrycorresponding to the immediately higher time and closest tem-perature value is selected. For example, in Fig. 7, finishes attime 1.35 ms with a temperature 78 C. To determine the appro-priate idle time slot length is accessed. As there is noexact entry with 1.35ms and 78 C, the entry cor-responding to termination time 1.5 ms and temperature 70 C is

Fig. 7. DITD online phase.

chosen. Hence, the processor will be switched to the idle statefor 0.5 ms before the next task, , starts.We should notice that, according to our temperature model

presented in Section V, the state of the system is defined byboth the die and the heat spreader temperatures. In our LUTs,however, we only consider the die temperature for taking thedecision on the idle slack. This is due to the following reasons.1) It is both impractical and potentially expensive to obtain,at run-time, temperature readings from the heat spreader.

2) The variations of the heat spreader temperature are smallcompared to those of the chip. This is due to the fact thatthe heat capacitance of the heat spreader is much largerthan that of the chip.

3) Considering also the heat spreader temperature as an addi-tional dimension in the LUTs would dramatically increasethe size of the tables without significant contribution to en-ergy efficiency.

Thus, when generating the LUTs, we will consider that, at thetermination of a task , the heat spreader has a certain expectedtemperature . In Section VII-E we will show how iscalculated.

B. Offline Phase

In the offline phase, one LUT table is generated for each task.The LUT table generation algorithm is illustrated in Fig. 8. Theoutermost loop iterates over the set of tasks and successivelyconstructs the table for each task . The next loop gen-erates entries corresponding to the various possible fin-ishing temperatures of . Finally, the innermost loop iter-ates, for each possible finishing temperature, over all consideredtermination times of task .The algorithm starts by computing the earliest and

latest possible finishing times , as well as the lowestand highest possible finishing temperature for each task. With a given finishing time and finishing temperature

of task , the innermost loop performs the slack dis-tribution step DITDOH, iteratively. We describe the DITDOHalgorithm in Section VII-C. For successive iterations, the fin-ishing temperature and time will be increased with thetime and temperature quanta and , respectively. Thecalculation of the parameters , and as


Fig. 8. DITD offline phase.

well as the determination of the granularities and number of en-tries along the time and temperature dimensions are presentedin Sections VII-D and VII-E, respectively.

C. DITDOH AlgorithmWhen calculating the actual LUT entries for a task , the

ITD algorithm DITDOH is performed to determine the idle slotlength following the termination of , with given termina-tion time and temperature, based on the problem formulationdescribed in Section IV-B. DITDOH is similar to SITDOH out-lined in Section VI-B. However, unlike the formulation used inSITDOH [see (23)(36)] which is based on SSDTC estimation,the formulation used for DITDOH is based on the estimation ofa transient temperature curve (TTC) described in Section V-C.Since we do not rely on the fact that successive iterations of theapplication are identical and that tasks execute always with theirworst case number of cycles, we do not calculate an SSDTCcorresponding to the dynamic steady state. But, instead, we es-timate a TTC.The formulation used for DITDOH is shown in (38)(53).

Formulation 2 DITD with No Overheads Consideration

Minimize:

(38)

Subject to:(39)

(40)

(41)(42)

(43)(44)(45)

(46)(47)

(48)

(49)

(50)(51)(52)

(53)

As mentioned in Section IV-B, the energy is optimized for thecase that the future tasks to execute their expected time

which, in reality, happens with a much higher probabilitythan, e.g., the (nevertheless, idle time slots are distributedsuch that, even in the worst case, deadlines are satisfied). Theobjective function (38) to be minimized is the total leakage en-ergy of further tasks to be executed in the current iteration:

. Equation (38) is similar to (23) with thefollowing two differences.1) It refers only to the remaining tasks .2) The execution interval of a task , which is divided into

subintervals, is not corresponding to the worst case, but to the expected case .

The optimization variables to be calculated are the idle slotlengths . Equation (40) requires that thesum of all idle slot lengths should be equal to the total availableidle time, where is the current tasks finishing time. The totalavailable idle time is calculated based on the assumption that allfuture tasks are executed with their expected workload.Equation (41) guarantees the deadline of task the next

task to be executed after the termination of the current task inthe worst case (task executed with ). In order to guar-antee that all future tasks meet their deadlines in the worst case,(42) requires that finishes before , in the worstcase. The latest finishing time (see Section VII-D) isthe latest termination time of task that still allows futuretasks, following , to satisfy their deadline even if their worstcase workloads are executed. Thus, (41) and (42) guarantee notonly that the deadline of is satisfied in the worst case butalso that finishes in time for all the remaining tasks to beable to meet their deadline in the worst case. Equation (43) en-forces the deadline of the remaining tasks, considering that they execute their expected workload. This

means that the idle time following task is determined suchthat it guarantees deadlines to be satisfied in the worst case but


is optimized for the situation that tasks execute their expectedworkloads.Similar to (27) and (28), (44) and (45) specify the processor

and heat spreader temperatures at the finishing of task :and . Equation (46) computes the processor temperature atthe beginning of task similar to (33), where is thechip temperature at the termination of the current task . Simi-larly, (47) computes the heat spreader temperature at the begin-ning of task , where is, as described in Section VII-A,the expected heat spreader temperature at the termination of task. is pre-calculated as will be explained in Section VII-E.

Equation (48)(53) compute the TTC of processor/heat spreaderbased on our TTC estimation method described in Section V-C,where and are the processor temperature at thebeginning and end of the th sub-interval during the execu-tion of task . Finally, throughout the execution of the futuretasks , the processor temperatures

should not exceed themaximal allowable working of the chip as imposed by theconstraint in (53). The above formulation is a convex nonlinearproblem and can be solved efficiently in polynomial time [29].

D. Time Bounds and GranularityIn the first step of the algorithm in Fig. 8, the and

for each task are calculated. The earliest finishing timeis calculated based on the situation that all tasks execute theirbest case execution time . The latest finishing time iscalculated as the latest termination time of that still allowsall tasks , to satisfy their deadlines when they executetheir worst case execution time .With the time interval for task , a straight-

forward approach to determine the number of entries along thetime dimension would be to allocate the same number of entriesfor each task. However, the time interval sizescan differ very much among tasks, which should be taken intoconsideration when deciding on the number of time entries .Therefore, given a total number of entries along the time di-mension , we determine the number of time entries in each

, as follows:

The corresponding granularity along the time dimension isthe same for all tasks and is obtained as follows:

E. Temperature Bounds and GranularityThe granularity along the temperature dimension is the

same for all task and has been determined experimentally.Our experiments have shown that values around 15 are appro-priate, in the sense that finer granularities will only marginallyimprove energy efficiency.To determine the number of entries along the temperature

dimension, we need to calculate the temperature intervalat the termination of each task. In fact, it is not

needed to determine the bounds of the temperature interval

exactly. A good estimation, such that, at run-time, temperaturereadings outside the determined interval will happen rarely, issufficient. If the temperature readings exceed the upper/lowerbound of the interval, the idle time setting corresponding tothe highest/lowest temperature value available in the LUT willbe used. We have developed an estimation technique for thetemperature interval , which balances computationcomplexity and accuracy of the results.In order to estimate the temperature bounds and , we

define the following two run-time scenarios. Worst Case Execution Scenario: In which the actual exe-cution time of each task is always equal to its worst caseexecution time: .

Best Case Execution Scenario: In which the actual execu-tion time of each task is always equal to its best caseexecution time: .

In both scenarios, the processor will execute the corre-sponding periodic power pattern repeatedly and the processortemperature will eventually reach the corresponding steadystate dynamic temperature curve (denoted as forthe worst case scenario and for the best case sce-nario, respectively). From the corresponding SSDTC, we canobtain, for each task , its finishing temperature. We use thefinishing temperature of task corresponding to the worst caseexecution scenario, , as the upper bound of the finishingtemperature of task ; the finishing temperatureof task corresponding to the best case execution scenario,

, will be used as the lower bound: .In order to obtain the we first perform the SITDOH

heuristic (see Fig. 6). Then, temperature analysis (see SectionV)produces the temperature curve for the worst case scenario withthe corresponding idle time distribution generated by SITDOH.The curve is obtained in a similar way, by replacing

with in the constraint in (25).With the upper and lower bounds and obtained for

each task, the number of the entries along the temperature di-mension, for task , is

where is the granularity along the temperature dimension.As mentioned in Section VII-A, when generating the LUTs,

we consider that, at the termination of a task , the heatspreader has a certain expected temperature . In orderto obtain these temperatures, we perform the same procedureas outlined above but, in this case, considering the expectedexecution time of each task: . We obtain the temper-ature curve corresponding to the heat spreader (seeSection V-B), from which we extract the expected temperatureof the heat spreader, , at the termination of each task .

VIII. EXPERIMENTAL RESULTS

A. Evaluation of the Thermal ModelExperimental Setup: We have evaluated our thermal model

considering platforms with parameter settings based on valuesfrom [30], [31] and [32]. We consider die areas of 6 6, 8 8,and 10 10 mm . The heat spreader area is five times the diearea, and the heat sink area is between 1.3 and 1.4 times the area


Fig. 9. SSDTC estimation with our approach VS. versus hotspot.

of the heat spreader. The thickness of the die and heat spreaderare 0.5 and 2 mm, respectively. The thickness of the heat sink isbetween 10 and 20 mm. The coefficients corresponding to thepower model in Section II-A are based on [2] and [24]. For thetemperature calculation (see Sections V-B and V-C) we haveconsidered a piecewise linear leakage model with three seg-ments, as recommended in [26].Accuracy: Wefirst performed a set of experiments to evaluate

the accuracy of our temperature analysis approach proposed inSectionV.We randomly generated 500 periodic voltage patternscorresponding to applications with periods in the range between5 and 100 ms. For each application, considering the coefficientsand platform parameters outlined above, we have computedthe SSDTC using the approach proposed in Section V-B and byusing Hotspot simulation. For each pair of temperature curvesobtained, we calculated the maximum deviation as the largesttemperature difference between any corresponding pairs ofpoints (in absolute value), as well as the average deviation. Fig. 9illustrates the results for different application periods. For ap-plications with a period of 50 ms, for example, there is no singlecase with a maximum deviation larger than 2.1 C, and the av-erage deviation is 0.8 C. Over all 500 applications, the averageand maximum deviation are 0.8 C and 3.8 C, respectively.We can observe that the deviation increases with the increasingperiod of the application. This is due to the fact that, with largerperiods, accuracy can be slightly affected by neglecting thethermal capacitance of the heat sink (see Section V-A).Computation Time: We have compared the corresponding

computation time of our SSDTC generation approach with thetime needed by Hotspot. Fig. 9 illustrates the average speedupas the ratio of the two execution times. The speedup is between3000 for periods of 5 ms and 20 for 100 ms periods. An in-creasing period leads to a larger linear system that has to besolved for SSDTC estimation (see Section V-B), which explainsthe shape of the speedup curve in Fig. 9.The accuracy and speedup of our approach are also depen-

dent on the length of the sub-interval considered for the temper-ature analysis (see Section V-B and Fig. 4). For the experimentsthroughout this paper, the length of the sub-interval is 2 ms. Thisis based on the observation that reducing the length beyond thislimit does not improve the accuracy significantly.

B. Evaluation of the ITD HeuristicsWe have used both generated test applications as well as a

real life example in our experiments to evaluate our DITD ap-proach presented in Section VII. This, implicitly, also evaluatesthe SITD approach in Section VI.

Experimental Setup: We have randomly generated 100 testapplications consisting of 30 to 100 tasks. The workload in theworst case (WNC) for each task is generated randomly in therange clock cycles, while the workload in thebest case is generated in the range clock cycles.To generate the expected workload of each task, the fol-lowing steps are performed.1) The value of the expected total dynamic idle time, , isgiven as an input: is the total dynamic slack when alltasks execute their workload in the expected case:

.2) is divided into a number of sub-intervals withequal length .

3) The sub-intervals are allocated among all tasks basedon a uniform distribution; as result, each task is allocateda number of sub-intervals.

4) The expected workload of task is, thus, deter-mined as: , where is theprocessor frequency when task is executed.

In order to evaluate our DITD technique, we have considereda straightforward approach (SFA) for comparison. This SFAscenario corresponds to the natural execution procedure for thecase when no idle time distribution is performed. Following thisapproach, tasks are executed according to a static schedule gen-erated based on the worst case execution time. According to thisschedule, the static slack is placed at the end of the application,after the last task. At run-time, when the tasks execute less thantheir WNC and the generated dynamic slack is large enough, theprocessor is put in idle mode.We have applied both the DITD and SFA approaches on the

same test application. When we simulate the execution of thetest applications, the actual number of executed clock cycles ofa task is generated using a random number generator accordingto the beta distribution [38]. The parameters

and are determined based on: 1) the expected work-load and 2) a given standard deviation of the executedclock cycles of task . The Hotspot system [7] is used to sim-ulate the sensor readings which track the temperature behaviorof the platform during the execution of a test application.In our experiments, the granularity along the time and tem-

perature dimensions for the LUT tables is set to 1.52.0 ms and15 20 , respectively. It is important to mention that in all ourexperiments we have accounted for the time and energy over-head imposed by the online phase of our DITD. Similarly, wehave also taken into consideration the energy overhead due tothe memory access. This overhead has been calculated based onthe energy values given in [33] and [34]. The energy and timeoverheads due to power state switching are set to 0.5 mJand 0.4 ms, respectively, according to [6].After applying both the DITD and SFA approaches on a test

application, we compute the corresponding leakage energy re-duction due to our DITD approach compared to the SFA:

%, where and are theconsumed leakage energy corresponding to the SFA and DITDapproach, respectively.Leakage Energy Reduction Versus Slack Time Ratios: We

first performed experiments considering different combinationsof static and dynamic idle time ratio . The static idle


Fig. 10. Leakage energy reduction with switching overheads.

Fig. 11. Leakage energy reduction with no switching overheads.

Fig. 12. Leakage energy reduction with different standard deviations.

time ratio is computed as: , whereis the deadline of the last task in execution order. The dy-

namic idle time ratio is calculated as: , where isthe total dynamic slack when all tasks execute their workloadin the expected case, as described earlier in this section. Fig. 10shows the averaged leakage energy reduction over all test ap-plications. The energy reduction achieved by DITD grows withthe available amount of static and dynamic slack. Withand , for example, leakage energy can be reduced with20% by applying our DITD approach.The DITD approach proposed in this paper achieves leakage

energy reduction due to two main features: 1) it is temperatureaware, which means that idle time is distributed such that thetemperature is controlled in order to minimize leakage and 2) itredistributes slack such that the number of idle slots which aretoo short to switch power state, is minimized. The followingquestion has to be answered: how much does the temperatureawareness of our approach contribute to the energy reduction?In order to answer this question we have repeated the aboveexperiments considering a hypothetical scenario with zeroswitching overhead: 0 mJ and 0 ms. The resultsare shown in Fig. 11. Under such a scenario, the processor canbe switched to the low power state for the duration of the totalidle time (regardless the length of the individual idle slots).Thus, the energy gains obtained with DITD compared to SFA,as illustrated in Fig. 11, are exclusively due to the temperatureawareness of the DITD approach.From Figs. 10 and 11 one can also observe the efficiency of

the ITD approach with only static slack (SITD, Section VI).

Fig. 13. Computation time.

The cases where (no dynamic slack) are, in fact, corre-sponding to those situationswhen only static slack is distributed.Obviously, in the cases that both and , there is noslack to distribute and, thus, the energy reduction is zero.Leakage Energy Reduction Versus Standard Deviation: As

mentioned, for our experiments we have generated workloadsfor each task according to a beta distribution ,where and are determined based on the expected work-load and standard deviation of the executed workload.For the above experiments, the standard deviation for eachtask is considered to be: . As thestandard deviation has an influence on the potential leakage re-duction, we have repeated the above experiments, consideringthree different settings of , namely,

, and . Fig. 12shows the leakage reduction % by applying our DITD ap-proach relative to the SFA, with different standard deviationsettings. We have considered test applications having static anddynamic ratios of: and . As can be observed,the efficiency of the DITD approach increases as the standarddeviation decreases. This is due to the fact that our DITD algo-rithm is targeted towards optimizing the energy consumption forthe case that tasks execute the expected number of cycles ENC.When the standard deviation is smaller, more of the actual ex-ecuted number of clock cycles are clustering around the ENCand, therefore, our DITD approach can achieve better leakagereduction.Computation Time: We have also evaluated the computation

time for the offline phase of our DITD approach. The results aregiven in Fig. 13.MPEG2 Decoder: We have applied our DITD approach to a

real-life application, namely anMPEG2 decoder, which consistsof 34 tasks.3 We have considered a platform with the size of thechip, heat spreader, and heat sink of 8 8 mm , 18 18 mm ,and 22 22 mm , respectively. The thickness of the chip, heatspreader, and the heat sink is 0.5, 2, and 15 mm, respectively.The execution time distribution of the tasks has been obtainedfrom simulations on the MPARM platform [35]. We consideredthe following two overhead settings: 1) 0.5 mJ,0.4 ms and 2) 1.0 mJ, 0.8 ms. The leakage energyreduction by applying our DITD approach relative to the SFAapproach is 32.5% and 40.8%, respectively.

IX. CONCLUSION

We first proposed a static temperature aware ITD approachfor leakage energy optimization where only static slack is3 http://ffmpeg.mplayerhq.hu/


considered. In order to consider both static and dynamic slack,we then proposed a dynamic temperature aware ITD approach,which consists of an offline and an online step. The experimentshave demonstrated that considerable energy reduction can beachieved by our temperature aware ITD approaches. In order toefficiently perform temperature analysis inside our optimizationloop for idle time distribution, we have also proposed a fast andaccurate system level temperature analysis approach.

REFERENCES[1] A. Andrei, P. Eles, and Z. Peng, Energy optimization of multipro-

cessor systems on chip by voltage selection, IEEE Trans. Very LargeScale Integr. (VLSI) Syst., vol. 15, no. 3, pp. 262275, Mar. 2007.

[2] Y. Liu, H. Yang, R. Dick, H. Wang, and L. Shang, Thermal vs energyoptimization for DVFS-enabled processors in embedded systems, inProc. ISQED, 2007, pp. 204209.

[3] T. Ishihara and H. Yasuura, Voltage scheduling problem for dy-namically variable voltage processors, in Proc. ISLPED, 1998, pp.197202.

[4] A. Andrei, P. Eles, Z. Peng, M. Schmitz, and B. M. Al-Hashimi,Quasi-static voltage scaling for energy minimization with timeconstraints, in Proc. DATE, 2005, pp. 514519.

[5] C. Xian, Y. H. Lu, and Z. Li, Dynamic voltage scaling for mul-titasking real-time systems with uncertain execution time, IEEETrans. Comput.-Aided Design Integr. Circuits Syst., vol. 27, no. 8, pp.14671478, Aug. 2008.

[6] R. Jejurikar, C. Pereira, and R. Gupta, Leakage aware dynamicvoltage scaling for realtime embedded systems, in Proc. DAC, 2004,pp. 275280.

[7] W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan, K. Skadron,and M. Stan, Hotspot: A compact thermal modeling methodology forearly-stage VLSI design, IEEE Trans. Very Large Scale Integr. (VLSI)Syst., vol. 14, no. 5, pp. 501513, May 2006.

[8] Y. Yang, Z. P. Gu, R. P. Dick, and L. Shang, Isac: Integrated spaceand time adaptive chip-package thermal analysis, IEEE Trans.Comput.-Aided Design Integr. Circuits Syst., vol. 26, no. 1, pp. 8699,Jan. 2007.

[9] S. Wang and R. Bettatin, Delay analysis in temp.-constrained hardreal-time systems with general task arrivals, in Proc. RTSS, 2006, pp.323334.

[10] R. Jayaseelan and T. Mitra, A hybrid local-global approach for multi-core thermal management, in Proc. ICCAD, 2008, pp. 618623.

[11] S. Zhang and K. S. Chatha, System-level thermal aware design ofapplications with uncertain execution time, in Proc. ICCAD, 2008,pp. 242249.

[12] R. Rao and S. Vrudhula, Fast and accurate prediction of the steady-state throughput of multicore processors under thermal constraints,IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 28, no.10, pp. 15591572, Oct. 2009.

[13] D. Brooks, R. P. Dick, R. Joseph, and L. Shang, Power, thermal,and reliability modeling in nanometer-scale microprocessors, IEEEMicro, vol. 27, no. 3, pp. 4962, May 2007.

[14] B. Nikhil, K. Tracy, and P. Kirk, Speed scaling to manage energy andtemperature, J. ACM, vol. 54, no. 1, pp. 139, 2007.

[15] T. Chantem, R. Dick, and X. Hu, Temperature-aware scheduling andassignment for hard real-time applications on mpsocs, in Proc. DATE,2008, pp. 288293.

[16] Y. Ge, P. Malani, and Q. Qiu, Distributed task migration for thermalmanagement inmany-core systems, inProc. DAC, 2010, pp. 579584.

[17] S. Zhang and K. S. Chatha, Approximation algorithm for thetemperature-aware scheduling problem, in Proc. ICCAD, 2007, pp.281288.

[18] S. Zhang andK. Chatha, Thermal aware task sequencing on embeddedprocessors, in Proc. DAC, 2010, pp. 585590.

[19] R. Rao and S. Vrudhula, Efficient online computation of core speedsto maximize the throughput of thermally constrained multi-core pro-cessors, in Proc. ICCAD, 2008, pp. 537542.

[20] M. Bao, A. Andrei, P. Eles, and Z. Peng, Temperature-awarevoltage selection for energy optimization, in Proc. DATE, 2008, pp.10831086.

[21] L. Yuan, S. Leventhal, and G. Qu, Temperature-aware leakage mini-mization technique for real-time systems, in Proc. ICCAD, 2006, pp.761764.

[22] C. Yang, J. Chen, L. Thiele, and T. Kuo, Energy-efficient real-timetask scheduling with temperature-dependent leakage, in Proc. DATE,2010, pp. 914.

[23] M. Bao, A. Andrei, P. Eles, and Z. Peng, Temperature-aware idle timedistribution for energy optimization with dynamic voltage scaling, inProc. DATE, 2010, pp. 2126.

[24] S. Martin, K. Flautner, T. Mudge, and D. Blaauw, Combined dy-namic voltage scaling and adaptive body biasing for lower power mi-croprocessors under dynamic workloads, in Proc. ICCAD, 2002, pp.721725.

[25] W. P. Liao, L. He, and K. M. Lepak, Temperature and supply voltageaware performance and power modeling at micro-architecture level,IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 24, no.7, pp. 10421053, Jul. 2005.

[26] Y. Liu, R. Dick, L. Shang, and H. Yang, Accurate temperature-de-pendent integrated circuit leakage power estimation is easy, in Proc.DATE, 2007, pp. 16.

[27] R. Rao and S. Vrudhula, Performance optimal processor throttlingunder thermal constraints, in Proc. CASES, 2007, pp. 257266.

[28] W.H. Pressa, S. A. Teukolsky,W. T. Vetterling, and B. P. Flannery, Nu-merical Recipes 3rd Edition: The Art of Scientific Computing. Cam-bridge, U.K.: Cambridge Univ. Press, 2007.

[29] Y. Nesterov and A. Nemirovskii, Interior-Point Polynomial Al-gorithms in Convex Programming. Philadelphia, PA: Society forIndustrial and Applied Mathematics, 1987.

[30] IBM, Powerpc 970 mp thermal considerations, Appl. Note.[31] Intel, Intel Core 2 Duo Mobile Processors on 65-nm process for em-

bedded applications: Thermal design guide,.[32] Intel, Intel Core 2 Duo Mobile Processors on 45-nm process for em-

bedded applications: Thermal design guide,.[33] S. Hsu et al., A 4.5-ghz 130-nm 32-kb l0 cache with a leakage-tolerant

self reverse-bias bitline scheme, IEEE J. Solid-State Circuits, vol. 38,no. 5, pp. 755761, May 2003.

[34] A. Macii, E. Macii, and M. Poncino, Improving the efficiency ofmemory partitioning by address clustering, in Proc. DATE, 2003, pp.1823.

[35] L. Benini, D. Bertozzi, D. Bruni, N. Drago, F. Fummi, andM. Poncino,Systemc cosimulation and emulation of multiprocessor soc designs,Comput., vol. 36, pp. 5359, 2003.

Min Bao received the M.S. degree in computer engineering from the Universityof Electronic Science and Technology, China, in 2007. She is currently pursuingthe Ph.D. degree from Linkoping University.Her research interests include embedded systems low-power and tempera-

ture-aware design, software/hardware co-design.

Alexandru Andrei received the M.S. degree in computer engineering from Po-litehnica University, Timisoara, Romania, in 2001 and the Ph.D. degree fromLinkoping University, Linkoping, Sweden, in 2007.He is with Ericsson, Linkoping, Sweden. His research interests include em-

bedded systems architectures and design, low-power design, real-time systemsand hardware/software codesign.

Petru Eles (M99) is a Professor of embedded computer systems with the De-partment of Computer and Information Science (IDA), Linkoping University,Linkoping, Sweden. His current research interests include embedded systems,real-time systems, electronic design automation, cyber-physical systems, hard-ware/software codesign, low power system design, fault-tolerant systems, de-sign for test. He has published a large number of technical papers in these areasand co-authored several books.

Zebo Peng (M91SM02) received the Ph.D. degree in computer science fromLinkoping University, Linkoping, Sweden, in 1987.Currently, he is a Professor of computer systems and Director of the Em-

bedded Systems Laboratory, Linkoping University. His research interests in-clude design and test of embedded systems, SoC testing, hardware/softwareco-design, and real-time systems. He has published over 250 technical papersand co-authored several books in these areas.