Top Banner
VLSI DESIGN 1998, Vol. 7, No. 3, pp. 225-z242 Reprints available directly from the publisher Photocopying permitted by license only (C) 1997 OPA (Overseas Publishers Association) Amsterdam B.V. Published under license under the Gordon and Breach Science Publishers imprint. Printed in India. Power Analysis of a 32-bit Embedded Microcontroller VIVEK TIWARI a’* and MIKE TIEN-CHIEN LEE b aDept, of Electrical Engineering, Princeton University, Princeton, NJ 08544; Fujitsu Laboratories of America, 77 Rio Robles, San Jose, CA 95134 A new approach for power analysis of microprocessors has recently been proposed [14]. The idea is to look at the power consumption in a microprocessor from the point of view of the actual software executing on the processor. The basic component of this approach is a measurement based, instruction-level power analysis technique. The technique allows for the development of an instruction-level power model for the given processor, which can be used to evaluate software in terms of the power consumption, and for exploring the optimization of software for lower power. This paper describes the application of this technique for a comprehensive instruction-level power analysis of a commercial 32-bit RISC-based embedded microcontroller. The salient results of the analysis and the basic instruction-level power model are described. Interesting observations and insights based on the results are also presented. Such an instruction-level power analysis can provide cues as to what optimizations in the micro-architecture design of the processor would lead to the most effective power savings in actual software applications. Wherever the results indicate such optimiza- tions, they have been discussed. Furthermore, ideas for low power software design, as suggested by the results, are described in this paper as well. Keywords: Embedded software, embedded systems, low power design, low power software, power estimation, power optimization 1. INTRODUCTION A very large fraction of the applications in all segments of the electronics industry are being implemented as embedded computer systems. The basic characteristic of these systems is the presence of both a hardware and a software component. The hardware component consists of application- specific circuits, while the software component consists of application-specific software running on dedicated microprocessors. The role of the *Corresponding author. software component is actually projected to grow in the future. A large number of embedded computing applications are power critical, i.e., power constraints form, an important part of the design specification. In light of the growing role of the software component, it is imperative to consider the power consumption of this compo- nent when analyzing the total system power consumption. In spite of its importance, very little previous work exists for analyzing power consumption 225
19

Power Analysis of a 32-bit Embedded

Jan 13, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Power Analysis of a 32-bit Embedded

VLSI DESIGN1998, Vol. 7, No. 3, pp. 225-z242Reprints available directly from the publisherPhotocopying permitted by license only

(C) 1997 OPA (Overseas Publishers Association)Amsterdam B.V. Published under license

under the Gordon and Breach SciencePublishers imprint.

Printed in India.

Power Analysis of a 32-bit Embedded Microcontroller

VIVEK TIWARIa’* and MIKE TIEN-CHIEN LEEb

aDept, of Electrical Engineering, Princeton University, Princeton, NJ 08544;Fujitsu Laboratories of America, 77 Rio Robles, San Jose, CA 95134

A new approach for power analysis of microprocessors has recently been proposed [14].The idea is to look at the power consumption in a microprocessor from the point of viewof the actual software executing on the processor. The basic component of this approachis a measurement based, instruction-level power analysis technique. The techniqueallows for the development of an instruction-level power model for the given processor,which can be used to evaluate software in terms of the power consumption, and forexploring the optimization of software for lower power. This paper describes theapplication of this technique for a comprehensive instruction-level power analysis of acommercial 32-bit RISC-based embedded microcontroller. The salient results of theanalysis and the basic instruction-level power model are described. Interestingobservations and insights based on the results are also presented. Such aninstruction-level power analysis can provide cues as to what optimizations in themicro-architecture design of the processor would lead to the most effective powersavings in actual software applications. Wherever the results indicate such optimiza-tions, they have been discussed. Furthermore, ideas for low power software design, assuggested by the results, are described in this paper as well.

Keywords: Embedded software, embedded systems, low power design, low power software, powerestimation, power optimization

1. INTRODUCTION

A very large fraction of the applications in allsegments of the electronics industry are beingimplemented as embedded computer systems. Thebasic characteristic of these systems is the presenceof both a hardware and a software component.The hardware component consists of application-specific circuits, while the software componentconsists of application-specific software runningon dedicated microprocessors. The role of the

*Corresponding author.

software component is actually projected to growin the future. A large number of embeddedcomputing applications are power critical, i.e.,power constraints form, an important part of thedesign specification. In light of the growing role ofthe software component, it is imperative toconsider the power consumption of this compo-nent when analyzing the total system powerconsumption.

In spite of its importance, very little previouswork exists for analyzing power consumption

225

Page 2: Power Analysis of a 32-bit Embedded

226 V. TIWARI AND M. TIEN-CHIEN LEE

from the point of view of software. Some attemptsin this direction are based on architectural levelanalysis of microprocessors. The underlying idea isto assign power costs to architectural modulessuch as datapath execution units, control units,and memory elements. In [10, 15] the power cost ofa module is given by the estimated averagecapacitance that would switch when the givenmodule is activated. More sophisticated statisticalpower models are used in [5, 6]. Activity factors forthe modules are then obtained from functionalsimulation over typical input streams. Power costsare assigned to individual modules, in isolationfrom one another. Thus, these methods ignore thecorrelations between the activities of differentmodules during execution of real programs.

Since the above .techniques work at higher levelsof abstraction, the power estimates they provideare not very accurate. For greater accuracy, onehas to use power analysis tools that work at lowerlevels of the design physical, circuit, or switchlevel [8, 9, 4]. However, these tools are slow andimpractical for analyzing the total power con-sumption of a microprocessor as it executes entireprograms. These tools also require the availabilityof lower level circuit details of microprocessors,something that most embedded system designersdo not have access too. This is also the reason whythe power contribution of software and thepotential for power reduction through softwaremodification has either been overlooked or is notfully understood.A recent work [14] overcomes these deficiencies

by developing a methodology that provides ameans for analyzing the power consumption ofagiven microprocessor as it executes a givenprogram. The idea is to use a measurement basedanalysis technique for developing and validatingan instruction level power model for any givenprocessor. Such a model can then be provided bythe processor vendors for both off-the-shelfprocessors, as well as embedded cores. This canthen be used to evaluate embedded software, muchas gate level power models have been used toevaluate logic designs. The ability to evaluate

software in terms of the power metric helps inverifying if a design meets its specified powerconstraints. In addition, it can also be used tosearch the design space in software power optimi-zation [13].The initial work in this direction has been in the

context of the Intel 486DX2, a general-purposeCISC architecture. This paper describes theapplication of this power analysis methodologyfor the Fujitsu SPARClite MB86934, a 32-bitRISC microcontroller [1-3] targeted for em-bedded applications. A comprehensive poweranalysis of this processor has been performedand an instruction level power model has beendeveloped. The salient results of the analysis aredescribed here. Interesting observations and in-sights based on the results are also presented. Thesuccessful application of the analysis methodologyfor two different processors provides validation forthe general applicability of this methodology. Thisis reinforced by a recent work based on theapplication of this analysis technique for aspecialized embedded DSP processor [7].

2. PROCESSOR OVERVIEW

The. SPARClite MB86934 is a SPARC-basedmicroprocessor optimized for use in embeddedapplications. A full description of the SPARClitefamily and of MB86934 (referred to as the ’934from here on) is available from other references[1-3]. However, some of the features that arerelevant for the remainder of this paper are brieflymentioned below:

Technology: 0.5 micron, 3 level metal CMOStechnology. There are three separate power pinsfor the on-chip phase locked loop (PLL),internal logic, and I/0, respectively. All powersupply connections can be at 3.3 V.On-chip floating point unit (FPU): A high-performance on-chip FPU executes single/dou-ble precision operations.On-chip FIFOs: FPU instructions can get theiroperarlds from a 32-bit FPU register file, or 6

Page 3: Power Analysis of a 32-bit Embedded

POWER ANALYSIS 227

on-chip FIFOs, which are fed directly frommain memory through DMA.On-chip caches: A 8 K, 32-byte line, instructioncache, and a 2 K, 16-byte line, data cache. Bothcaches are 2-way set associative and employ awrite-through policy and a LRU replacementalgorithm. Cache entries can be locked and thecaches can also be disabled.Large integer register file: The integer registerfile consists of 136, 32-bit registers, which areorganized into 8 overlapping windows.Software controlled power management: A soft-ware mechanism is provided to disable theclocks to various functional units in order toconserve power.

3. EXPERIMENTAL METHOD

The instruction level power analysis techniquerelies on the ability to measure the average currentdrawn by the processor. This is motivated by theformulas for the power and energy cost of aprogram. The average power, P, consumed by amicroprocessor while running a certain program isgiven by: P=Ix Vcc, where I is the averagecurrent, and Vcc is the supply voltage. Sincepower is the rate at which energy is consumed, theenergy, E, consumed by a program is given by:E= Px T, where T is the execution time of theprogram. This in turn is given by: T= Nx-, whereN is the number of clock cycles taken by theprogram, and - is the clock period.For the experimental setup used in this study,

Vcc was 3.3 V and -was 50 ns, corresponding tothe 20 MHz systems clock. Thus, if the averagecurrent for an instruction sequence is I Amperes,and the number ofcycles it takes to execute is N, theenergy cost of the sequence is given by: E 1 xN x-, which equal: (16.5x10-8IxN) Joules.Throughout the rest of the paper, in order tospecify the energy cost of an instruction (instruc-tion sequence), the average current will be specified.The number of cycles will either be explicitlyspecified, or will be clear from the context.

3.1. Current Measurement

From the above discussion it is evident that tomeasure the energy cost of a program, the averagecurrent drawn by the CPU during the execution ofthe program has to be measured. The measure-ment method employed was based on the test andmeasurement capabilities of a commercial ICtester. The program under consideration was firstsimulated on a VERILOG model of the CPU. Thisproduces a trace file consisting of vectors thatspecify the exact logic values that would appear onthe pins of the CPU for each half-cycle during theexecution of the program. The tester then appliesthe voltage levels specified by the vectors on eachinput pin of the CPU. This recreates the sameelectrical environment that the CPU would see ona real board. The current drawn by the CPU ismonitored by the tester using an internal digitalammeter. Now, the current drawn by the CPUvaries over the execution of a program, and so theammeter may not yield a steady visual reading. Toovercome this, the method used in the case of the486DX2 is applied [14]. The programs beingconsidered are put in infinite loops. Thus, theresulting current waveforms are now periodic. Theammeter averages current over a window of time

(about 100 ms) for the purpose of analog to digitalcoversion. If the period of the current waveform ismuch smaller than this window, a stable reading isobtained.

3.2. Instruction Level Power Analysis

The above method makes it feasible to measure thepower cost of a given program. By designingspecial programs and measuring their power cost,it is possible to obtain the basic informationneeded for an instruction level power analysis ofthe processor, based on the following hypothesis.Consider a program consisting of several instancesof a certain instruction. Since the CPU is executingthe same instruction over and over again, it seemsintuitive that the entire activity in the CPU can beattributed to that instruction. The power cost of

Page 4: Power Analysis of a 32-bit Embedded

228 V. TIWARI AND M. TIEN-CHIEN LEE

the CPU for the program can be considered as thebasic power cost of the given instruction. In realprograms there may be other effects involving morethan one instruction that can impact the powercost, e.g., the effect of circuit state, pipeline stalls,and cache misses. By designing programs wherethese effects occur repeatedly, can similarly providea way for assigning power Costs to these effects too.

This hypothesis has been validated for the Intel486DX2. It has also been found to be applicable forthe ’934, as the subsequent sections will show. Theinstruction level power model that has beendeveloped for the two processors has the samebasic components. The first ofthese is the set ofbasecosts of instructions. The base cost of a giveninstruction is obtained by creating a programconsisting of several instances of the instructionexecuting in a loop. The other component of thepower model is the power cost of inter-instruction

effects. The first of these is the effect of change ofcircuit state between consecutive instructions. Dur-ing determination of the base costs, the sameinstruction is executed again and again. It can beexpected that the change in circuit state betweenconsecutive instructions will be less here, than forthe case in which consecutive instructions differ.The quantity circuit state overhead is introduced toaccount for this effect. This effect is illustrated insome of the later sections and is discussed in detailin Section 10. The overall instruction-level powermodel and its use in estimating the power .con-sumption of programs is described in Section 11.

4. POWER ANALYSIS OF THE ’934

In the subsequent sections, the specifics of theinstruction level power model for the ’934 arepresented. Other results that highlight the char-acteristics of the power consumption, as it relatesto instructions and software, are also reported.For the sake of clarity, the experiments are dividedinto several categories, each of which is treated in aseparate section. The results include the powercosts of the important instructions, and examples

that illustrate the power model that is used for theestimation of power consumption of instructionsequences. The power costs of external memoryaccesses, and the effect of the caches on the overallpower cost is also explored. Results are alsoprovided for the impact of software controlledpower management on the power cost of instruc-tions. The salient observations and interestinginsights based on the results of each section arealso briefly discussed following the results. One ofthe benefits of an instruction-level analysis is thatit provides cues as to what optimizations in themicro-architecture design would lead to the mosteffective power savings in actual software applica-tions. Wherever the results indicate such optimiza-tions, they have been discussed. Furthermore,ideas for low power software design, as suggestedby the results, are also described.The following observations are valid for all

experiments reported in this paper. Repeated runsofan experiment at different times resulted in only avery small variation in the observed average currentvalues. The variation was in the rage 4-1-mA. Thecurrent drawn by the power pin connected to the on-chip PLL was very small and below the measuringrange of the tester. The current drawn by the powerpins connected to the internal logic and I/0 circuitryis denoted by I1 and I2, respectively. The symbol ’&’used in the tables below for an instruction pair ’i &j’denotes an instruction sequence where instructionsand j are executed alternately. Instructions arespecified using the standared SPARC assemblylanguage syntax [2]. In particular, % xi refers to theth register of type x, [% xi] refers to the contents of

the memory location that is addressed throughregister % xi, and the destination operand is alwaysspecified as the rightmost operand in an instruction.

5. INTEGER ALU INSTRUCTIONS:CACHES ENABLED

Table I shows the base costs for some integerinstructions. The caches, prefetch, and writebuffers are enabled. These modules can be enabled

Page 5: Power Analysis of a 32-bit Embedded

No. Instruction

TABLE

POWER ANALYSIS 229

Sample integer ALU instructions: caches enabled

Register contents I1 (mA) 12 (mA)

or %g0, 0, %102 or %g0, 0xfff, %103 or %g0, %i0, %104 or %g0, %i0, %105 add %i0, %00, %106 add %i0, %00, %107 add %i0, %00, %108 add %i0, %00, %109 add %00, %il, %1210 srl %i0, %00, %1011 srl %i0, 1, %1012 srl %il, %05, %g313 or %g0, %r16, %i014 orcc %il, %00, %1115 subx %g0, %r16, %i016 xor %g0, %r16, %i017 xor %g0, %r17, %i018 andcc %gl, 0xaaa, %1019 sl 1%04, 0xT, %0620 umul %i0, 0x2, %0321 mul %g0, %r29, %r27

177 21174.5 21

(%i0=0) 177.5 21(%i0 =0xfff) 173.5 21(%i0 0, %00=0) 178 21(%i0 0, %o0 0xfff) 174 21(%i0 0xfff, %00=0) 174 21(%i0 0xfff, %o0 0xfff) 173 21(%00=0, %il =0x555) 174.5 21(%i0 0xfff, %o0=0xl) 179 21(%i0 0xfff) 176 21(%i0=0x555, %05= 1) 174.5 21(%r16=0) 178 21(%il 0x555, %00 0xaaa) 173 21(%r16=0) 172 21(%r16---0) 177.5 21(%r17=0) 176 21(%gl 0x555) 179 21(%04 0xf0) 173.5 21(%i0 0xaaa) 174.5 21(%r29=0) 177 21

or disabled by writing into a specific systemcontrol register. The base costs are shown in termsof the I1 and 12 current. All the instructionsshown execute in one cycle, except entry 20, whichexecutes in 2 cycles.

Observations and Comments

Integer ALU instructions tend to have very similarcosts, as shown in the above table. They vary inthe range of 170-180 mA, in terms of ! current.12 current is mostly stable around 21 mA. Thereason for the low 12 current is that the caches areenabled, and thus, after one iteration of the loop,the instructions are always available in the instruc-tion cache, and there is no traffic on the I/0 pins.The I1 current shows a limited variation

depending upon the actual value of the dataoperands used. Variation due to the use ofdifferent registers is not significant. Entries and2 show the cost for an OR instruction for twodifferent immediate operand values. Entries 3 and4 show the costs for the OR instruction when onlyregister operands are used, but the content ofone of the registers is different. Entries 5 to 9 showthe costs for an ADD instruction for different

combinations of the operands. There seems to be acorrelation between the number of l’s in thebinary representation of operands and the basecost-more the l’s, lesser the cost. The reason forthe correlation has to do with the underlyingcircuit style used for implementing the datapathmodules and busses. In any case, the overall rangeof variation is very limited, and thus the use ofaverage base costs for instructions should sufficefor program energy estimation purposes. This infact is the only option in cases where the exactvalue of operands cannot be determined untilruntime.

An interesting observation leading from theabove results is that the cost of the ALUinstructions doesn’t seem depend much on theALU operation that is being performed. Thecost of an OR, SHIFT, ADD, or MULTIPLYall seems to be about the same. It may well bethe case that the differences in the circuit activityfor these instructions are much less relative tothe circuit activity common to all instructions.Thus, these differences are not reflected in thecomparisons of the overall current cost. Never-theless, the almost complete lack of variation is

Page 6: Power Analysis of a 32-bit Embedded

230 V. TIWARI AND M. TIEN-CHIEN LEE

somewhat counter-intuitive. For instance, it isexpected that the logic for an OR should bemuch less than that for an ADD, thus leading tosome variation in the overall current drawn forthese instructions.

The reason for the similarity of the costs mostlikely has to do with the way ALUs aretraditionally designed. All the different ALU sub-functions are fed by a common bank of inputs,and the outputs of the appropriate sub-functionare selected by a multiplexor structure. Now, inany given cycle, the results of only one sub-function are needed. Thus, the circuit activity inthe other sub-functions is a waste of power. Thedesign can be modified for low power by extendingthe principles of automatic power management. Ifthe inputs of the sub-functions that are not neededare prevented from switching, the power consumedin these sub-functions can be saved. This observa-tion motivates the concept of guarded evaluation,which has been explored in detail in anotherreference [12].

6. INTEGER ALU INSTRUCTIONS:CACHES DISABLED

Table. II shows some of the same instructions asthe previous table. However, in this case, the on-chip caches have been disabled. Prefetch and writebuffers are also disabled. The number of memorywait states is zero. Column 4 shows 11 current inthis case. The 12 current was 134 mA for all

entries. Column 5 shows the I current for the casewhen the caches are enabled. The I2 current in thiscase was 21 mA.

Observation and Comments

Since the instruction cache is disabled, everyinstruction access goes to the external I/0 pins.The 12 current is therefore higher than whencaches are enabled. The I current (internal logiccurrent) is also about 10 mA higher. Entries 3 and6 show what happens when different instructionsare executed together. This will be discussed ingreater detail in Section 10.

In terms of overall current, disabling theinstruction cache leads to a total CPU currentincrease of about 123 (= 10+(134-21)) mA, i.e.,about 62%. However, when the cache is disabledin the ’934, every instruction fetch takes two cycles,even for a zero wait state system. Thus, in terms ofenergy, disabling the instruction cache leads to atleast about a 124% (=2x62%) increase in theenergy consumption. This points to two things:

Accessing the cache is much more energyefficient than accessing external memory. Thus,attempts to increase the cache hit rate throughsoftware modifications will be very beneficial. Itis further indicated that attempts to increase thehit rate through architectural transformationsmay also help reduce the overall energy con-

sumption.In certain embedded applications, the designermay choose to disable the caches. This is usually

TABLE II Integer ALU instructions: caches disabled vs. enabled

Instruction Register contents disabled enabledI1 (mA) I1 (mA)

or %gO, O, %10or %gO, Oxfff, %10l&2or %gO, %i0, %10or %gO, %i0, %104&5srl %i0, %o0, %10srl%iO, 1, %10

(%i0=0)(%i0 0xfff)

(%i0 Oxfff, %o0 =Oxl)(%i0 Oxfff)

187.5 177184 174.5196 192188 177.5184 173.5192 187.5188.5 179184 176

Page 7: Power Analysis of a 32-bit Embedded

POWER ANALYSIS 231

done to improve the performance predictabilityfor real-time systems. However, this will lead toa penalty in terms of the system energyconsumption, and thus, the battery life. Thisfact has to be understood and weighed in, whendeciding on whether the caches should bedisabled.

7. LOAD AND STORE INSTRUCTIONS:CACHES ENABLED AND LOCKED

Table III shows the cost of some instructions thatreference memory. Since the ’934 is a RISC, load-store machine, the only instructions that explicitlyreference memory are the loads and the stores. Theabove results are for the specific case when thecaches are active and the entries in the data cacheare locked. This implies that every data access is acache hit. In addition, since the cache entries arelocked, the store (write) instructions also don’t goout to external memory. Note that the ’934 has awrite-through cache, and thus in the normal case,each data write also goes out to the external bus.

Since there is no traffic on the I/0 pins, the 12current is low. Each instruction also executes inone cycle in this case.

Observations and Comments

Entries and 2 are direct loads and entries 10 and11 are direct stores. The rest of the instructionsutilize the indirect addressing modes. The resultsindicate that there is not much difference betweenthese two addressing modes, in terms of basecurrent. Entries 3 to 6 show the variation in thecost of a load for a fixed address but differeingdata operands. Entries 3, 7, 8, and 9 show thevariation for a fixed data operand but differingaddresses. Entries 12, 13, and 12, 14, show thecorresponding variation in the case of stores. Thegeneral trend points to a correlation between basecost and the number of l’s in the binaryrepresentation of the data operand and thememory address. This is similar to what was seenin Section 5 more the l’s, lower the cost. Thevariation in the costs, though, is again limited.Entries 16 to 23 show what happens when different

TABLE III Load and store instructions: caches enabled and locked

No. Instruction Register contents I (mA) 12 (mA)

ld [OxO], %i02 ld [Oxffc], %i03 ld[% 10], i04 ld [%10], %i05 ld [% 10], %i06 ld [% 10], %i07 ld[% 10],%i08 ld [%10], iO9 ld[% 10],%i010 st %i0, [OxO]11 st %i0, [Oxffe]12 st %i0, [%10]13 st %i0, [% 10]14 st %i0, [% 10]15 ldub [%10], %i516 3&417 3&518 3&619 3&720 3&821 3&922 12& 1323 12& 14

(% iO=O)(%i0=0)(%i0 0, %10-0)(%i0 Oxfff, %10 =0)(%i0 Oxffffff, %10=0)(%i0 Oxffffffff, %10 O)(%i0 0, %10 0xffc)(%i0 O, %10 Oxfffffc)(%i0 O, %10 Oxfffffffc)(%io=o)(%io=o)(%i0 O, %10=0)(%i0 Oxfff, %10=0)(%i0 0, %10 0xffc)(%i5 0xaaa, %10=0)

191.5 21187 21192 21189.5 21187.5 21185 21191 21188 21185 21173 21169 21175 21173.5 21172 21192.5 21206 21213 21216 21202.5 21207 21211 21185 21183 21

Page 8: Power Analysis of a 32-bit Embedded

232 V. TIWARI AND M. TIEN-CHIEN LEE

instructions execute together. The data andaddress registers used for each instruction in thepair were different, but the register contents werethe same as shown in the individual instructionentries. Entry 16 is for the case when theinstructions in entry 3 and 4 execute alternately.The current is higher than the average of the twobase costs. This is due to the effect of circuit stateoverhead. 12 data operand bits flip between entries3 and 4. The entries 17 and 18 show the results forgreater data flips. Entries 19 to 20 show the resultswhen the address bits flip between adjacentinstructions. The results indicate a positive corre-lation between the number of bit flips, and theincreased effect of circuit state.The results also lead to the following interesting

observations:

A comparison between Tables II and IV showsthat cache accesses aren’t much more costlythan register accesses. Cache reads are bout 10mA more costly, and cache writes are about thesame cost as register accesses. Since both cacheand register accesses take one cycle, the energycomparison shows the same relation. Thisobservation is in stark contrast to what wasobserved in the case of the Intel 486DX2, wherecache accesses were much more costly thanregister accesses. The reason for the similarity inthe cost of cache accesses and register accesses in’934 is most likely due to the large size of its

register file. The ’934 is a RISC, load-storearchitecture, and it is characteristic for thisarchitectural style to use a large number ofregisters. The register file has 136 registers. Inaddition, it is multiported, and is windowed. Inconstrast the 486DX2 has a simple register filewith only 8 registers.

This observation illustrates an interestingCISC vs. RISC trade-off with regards to power.On one hand, the availability of a larger numberof registers can help reduce the use of memoryoperands, leading to power reductions. But onthe other hand, the larger register file causeseach register access itself to be costlier.The data also points to the fact that micro-architectural or circuit transformations to opti-mize the register file for low power will be verybeneficial in terms of overall power reductions.The load-store design of ther ’934 involves veryheavy usage of the registers, and a lower powercost of accessing registers will translate intopower reductions for all programs.It should be noted that the use of memoryoperands does have a high cost even in the ’934,due to the possibility of cache misses. Also, ifthe cache is unlocked, stores will incur addi-tional cost in terms of 12 current (as. shown inthe next section), and memory system current.Thus, the use of memory operands shouldcertainly be avoided. This also points out thatthe cache locking feature should be exploited as

No.

TABLE IV Store instructions: caches enabled and unlocked

Instruction Register eontents I (mA) 12 (mA)

2

45678910111213

st %i0, [%10]st %i0, [%10]st %i0, [%10]st %i0, [%10]l&2l&3l&4st %i0, [%10]st %i0, [%10]st %i0, [%10]l&8l&9l&lO

(%i0 0, %10=0) 198(%i0 0, %10 0xffc) 191(%i0 0, %10 0xfffffc) 185(%i0 0, %10 0xfffffffc) 181.5

198199200

(%i0 0xfff, %10=0) 193(%i0 0xffffff, %10=0) 191(%i0 Oxffffffff, %10 O) 189

203207211

1481157146137116106148150150173193206.5

Page 9: Power Analysis of a 32-bit Embedded

POWER ANALYSIS 233

far as possible, for applications where energyconsumption is a design constraint.

8. STORE INSTRUCTIONS: CACHESENABLED AND UNLOCKED

Table IV shows the costs of some store instruc-tions when the caches are enabled but areunlocked. Since the data cache is write-through,all the stores also reference the external mainmemory. The number of memory wait states iszero. However, the design of the ’934 imposes anextra cycle for every bus transaction. During thiscycle the bus is idle.

Observations and Comments

Most typical applications do not lock the datacache. Thus, the stores in these applications will goout to the external bus, leading to higher 12current, as shown in table. This table, therefore,reflects the more typical cost of memory writes.Note that there will also be an additional systemenergy penalty due to the current being drawn bythe external memory.

Entries 1, 8, 9 and 10 show the variation in thecost of the stores for a fixed address but varyingdata. There is a minor decrease in both I and 12current for increasing number of l’s. Entries 11 to13 consider instruction sequences where differentstores alternate. For example, entry 11 shows thecost for a sequence consisting of the instructions inentry 1 and 2 appearing in succession. The I 1 and12 currents are greater than the average of thecurrent costs for the individual instructions. This isanother illustration of the effect of circuit stateoverhead. The I overhead represents the effect ofthe circuit state in the internal logic circuits, whilethe 12 overhead represents the effect of switchingon the data pins. Entry 11 involves 12 bit flips atthe data lines, while entries 12 and 13 involve 24and 32 bit flips respectively. As expected, greaternumber of bit flips, result in greater current. The

increase in current is greater in the case of 12current. This too is expected, since the I/0 padstypically involve larger capacitive loads.

Entries to 4 show the variation in the cost ofstores for a fixed data value but varying addresses.The I current decreases with an increase in thenumber of l’s in the binary representation ofthe address. The I2 current also decreases, but thedecrease is very drastic. For example, considerentries and 4. Entry has no l’s in the address,and entry 4 has 30 l’s. The difference in the 12current is 102 mA. This translates into about 3.3mA per each occurrence of "0" in the binaryrepresentation of the address. Comparisons be-tween the other entries also yield the same result.This observation seems strange, since if the sameaddress is being put on the bus for everyinstruction, the address pins should not switch.However, the address pins do switch due to thefollowing reason. Every memory transactioninvolves an extra cycle during which the bus isidle. During these bus idle cycles, the ’934 pulls upthe address pins, i.e., the pins go to the logicalvalue 1. Thus, even if back to back storeinstructions use the same address, there is anintervening cycle when the address pins are all l’s.This means that the pins corresponding to theaddress bits that are 0 will switch each time. Morel’s in the address value means less switching, andthus lower 12 current.

Entries 5 to 7 show another illustration of thiseffect. The 12 value for a pair of stores is about thesame as the average of the 12 values of theindividual stores. There is no circuit state over-head. The reason being that the intervening busidle cycle, in which the address pins are pulled up,isolated the stores from each other.The above results lead to the following observa-

tions:

The above results show that occurrence of O’s inthe address values means greater current cost,on the order 3.3 mA of 12 current for eachoccurrence of a 0. This suggests that if data andinstructions are stored at the higher end of

Page 10: Power Analysis of a 32-bit Embedded

234 V. TIWARI AND M. TIEN-CHIEN LEE

memory, the program energy cost may bereduced. The reason for this is that, on theaverage, addresses then will have lesser O’s inthem. The power reduction can potentially bevery significant.It should be noted that the higher current cost ofO’s in the address is a manifestation of the effectof circuit state (in other words, switching) on theaddress pins. Now, most real systems utilize waitstates, since memory access times are usuallyslower than the CPU clock period. If thenumber of wait states increases, there will be agreater number of bus idle cycles. We know thatduring the bus idle cycles, the address pins are ata constant 1. Thus, more wait states means thaton the average, the address pins will switch lessoften, leading to a lesser impact on the overallsystem energy cost. This is in line with theobservation in the case of the 486DX2, where itwas noted that for real systems, switching on theexternal pins had limited impact on the overallenergy cost of programs.Switching on the data pins also leads to anincrease in the current cost, though this increaseis limited. While attempts to reduce this costthrough software modifications may be bene-ficial, the benefits are likely to be very modest.This is because of the difficulty in applying thesemethods in general, and also because the energyimpact of the switching itself is limited. This willbe more so in case of real systems that have slowbuses and utilize wait-states.A sequence of back to back stores causes thewrite buffer to fill up, and leads to extra cyclepenalties, referred to as write buffer stalls. Thecycles penalties translate into energy penalties. Ifthe memory system is slow, and requires waitstates, the number of write buffer stalls willincrease. Software modifications to decreasethese stalls will result in lower power. This canbe done by scheduling instructions that don’trequire memory transactions to occur betweenconsecutive store instructions. Specific experi-ments can also assign energy costs to the write

buffer stall cycles, as has been done in the case ofthe 486DX2 [14].

9. FLOATING POINT INSTRUCTIONS

Table V shows the costs for some typical floatingpoint instructions. The caches were enabled andunlocked in all experiments. The FPU is pipelinedand Column 4 shows the throughput for eachinstruction. Column 5 shows the I current. The12 current was 21 mA for all cases. The energycost of an instruction is proportional to theproduct of the total current and the numbershown in Column 4.

Observation and Comments

The results indicate that most instructions thatinvolve the FPU have similar cost. For example,consider entries 1 to 14, and 21 to 23, all of whichtake one cycle, and don’t cause any FPU pipelineinterlocks. The dependence of current on the valueof operands is almost negligible, and this may haveto do with the circuit design of the FPU. Inaddition, dependence of current on the type ofFPU operation is also not exhibited. This may bedue to the same reasons as discussed in Section 5.Instructions for loading values into floating pointregisters (entries 15 to 20) result in costs that aresimilar to those seen in the case of integer registers.The trend with respect to the current cost and thenumber of l’s in the data operands is also similar.

Entries 24 to 26 show floating point divideinstructions, and entries 32 to 34 show square rootinstructions. The current variation for differentoperand values is negligible. These instructionstake 13 cycles in a particular FPU pipeline stage.This leads to 12 pipeline interlocks. This meansthat an FPU instruction that immediately followsone of these instructions will have to wait for 12cycles. However, the integer pipeline may not beheld up in most cases, and can continue to execute.Entry 27 shows what happens when a NOP

Page 11: Power Analysis of a 32-bit Embedded

POWER ANALYSIS 235

TABLE V Floating point instructions: caches enabled

Instruction Register contents n I1 (mA)

fitos %f4, %f02 fitos %f4, %f03 fmovs %f4, %f04 fmovs %f4, %f05 fmovs %f4, %f06 fmovs %f4, %f07 fitod %f4, %f08 fadds %f8, %f4, %f09 fadds %f8, %f4, %f010 fadds %f8, %f4, %f011 fadds %f8, %f4, %f012 faddd %f8, %f4, %f013 faddd %f8, %f4, %f014 faddd %f8, %f4, %f015 ld [OxO], %f816 ld [OxO], %f817 ld [OxO], %f818 ldd [OxO], %f819 ldd [OxO], %f820 ldd [OxO], %f821 fmuls %f8, %f4, %fO22 fmuls %f8, %f4, %fO23 fmuls %f8, %f4, %fO24 fdivs %f8,%f4, %fO25 fdivs %f8,%f4, %fO26 fdivs %f8,%f4, %fO27 26 & nop28 26 & 4 nop’s29 26 & 12 nop’s30 26 & add31 26 & 12 add’s32 fsqrts %f4, %fO33 fsqrts %f4, %fO34 fsqrts %f4, %fO35 34 & nop36 34 & 4 nop’s37 34 & 12 nop’s38 34 & add39 34 & 12 add’s

(%f4=0) 177.5(%f4 0xfff) 177.5(%f4=0) 175(%f4=0xff) 175(%f4=0xffff) 174(%f4 0xfffff) 175(%f4=0). 178(%f8 =0, %f4=0) 175.5(%f8 Oxffff, %f4= Oxffff) 176(%f8 0X555555, %f4 Oxaaaaaa) 177(%f8 Oxffffff, %f4 Oxffffff) 178(%f8 =0, %f4=0) 177(%f8 Oxffff, %f4 Oxffff) 177.5(%f8 =0x555555, %f4 Oxaaaaaa) 177.5(%f8 =0) 205(%f8 0x4b7ff) 198(%f8 0x4b7fffff) 193(% f8=0,0) 214(%f8 0x416fffff, e0000000) 200(%f8 0x4b7fffff, 4b7fffff) 192(%f8 =0, %f4=0) 174(%f8 0xfff, %f4 0xfff) 175(%f8 0x555555, %f4 0xaaaaaa) 175(%f8 0xaaaaaa, %f4=0x555555) 13 167.5(%f8 0xffff, %f4 0xffff) 13 168(%f8=0, %f4= 1) 13 167.5

13 181.513 181.513 18213 17913 177.5

(%f4 0xfe01) 13 173(%f4 0xaaaaaa) 13 173.5(%f4=0) 13 174

13 18413 184.513 18513 18113 180.5

instruction (internally treated as an integer in-struction) appears after a divide instruction, theexecution of this instruction is hidden within the12 interlock cycles of the divide. Entries 28 to 31show other examples when integer instructionsfollow a divide instruction, and entries 35 to 39show the same for the square root instruction.These entries show that the current cost in this caseisn’t much more than when no integer instructionsare executed in the FPU interlock cycles. Thisleads to the following insights:

When integer instructions are executed duringthe FPU interlock cycles, the current doesn’t

increase much beyond what it is when theinterlock cycles are completely idle. This sug-gests that during the FPU interlock cycles,switching activity doesn’t completely stop inthe other parts of the CPU. If this activity isuseless, then eliminating it can result in powerreduction during .the interlock cycles. Thisrepresents another opportunity where automaticpower management of guarded evaluation maybe useful.The results also show that for the currentimplementation of the ’934, it is beneficial toexecute integer instructions during the FPUinterlock cycles. The current cost is not much

Page 12: Power Analysis of a 32-bit Embedded

236 V. TIWARI AND M. TIEN-CHIEN LEE

higher than when the integer instructions areexecuted independently. Therefore, the decreasein the number of execution cycles translated intoactual energy reduction. Execution cycles arereduced, since the cycles required to execute theinteger instructions are overlapped with, andthus hidden, in the FPU interlock cycles. Soft-ware optimizations to achieve this can therforebe considered as both performance as well asenergy optimizations.

10. INTER-INSTRUCTION EFFECTS

The previous sections mainly focussed on the basecosts of instructions. The base cost of a giveninstruction is obtained in isolation from otherinstructions by repeatedly executing the sameinstruction. However, real programs consist of asequence of different instructions. Several inter-instruction effects can occur in these mixedsequences. Base costs themselves are not adequateto model the energy cost of these effects. Theseinter-instruction effects are discussed below.

10.1. Effect of Circuit State Overheadand Instruction Reordering

The effect of circuit state is an inter-instructioneffect that has been alluded to in the previoussections. The purpose ofTable VI is to illustrate andquantify this effect. Base costs of several instruc-tions as well as the costs for pairs of instructions areshown in the table. Only the I current is shown.The 12 current was 21 mA in all cases. For entrieswith pairs of instructions, Column 4 shows thecurrent for the combined sequence, and Column 5shows the circuit state overhead. This value is thedifference between the actual current cost of aninstruction, and the average of the base costs of theindividual instructions.

Observations and Comments

The existence of circuit state overhead can beattributed to the fact that each instruction executes

in the context of the circuit state set by theprevious instruction. The greater the change in thecircuit state between instructions, greater shouldbe this overhead. While the change in the circuitstate can result from any part of the processor, thecommon notion is that it is basically due to thechange in the opcodes of adjacent instructions.Entries 6 to 9 in Table VI show this quantitativelywith an example. With increasing number of bitflips in the opcodes of adjacent instructions, theoverhead increases. However, Table VI also showsthat switching of the opcodes is not the only sourceof circuit state overhead. For example, entries 18,19, and 23, 24 involve almost the same number ofopcode flips. However, entries 23, 24 have a muchhigher overhead cost.

Table VI, and some of the examples presented inSection 7, as well as in the next section, quantifyseveral instances of circuit state overhead. Theseand several other examples indicate that theoverhead varies between 0 and about 34 mA.The overhead between integer instructions istypically below 20 mA. The overhead betweenfloating point and integer instructions is higher,typically in the range 25-34 mA. The mostimportant aspect of this observation is that therange of variation in this overhead is small,compared to the overall CPU current cost.A recent idea in the area of software design for

low power is to reorder instructions to reduce thepower cost of a program [11]. This can be seen asan attempt to reduce the average current cost of aprogram by minimizing the circuit state overhead.Our experiments based on actual energy measure-ments on the ’934, however, reveal that thistechnique does not translate into significant over-all energy reduction. The reason is that the circuitstate overhead is bounded in a small range anddoes not show very significant variation. Thus,different instruction schedules will not vary signi-ficantly in their current costs.

Table VII illustrates this with an example.Entries 1 to 7 show a set of instructions. Entriesa and e show the current cost of different sequen-ces consisting of these instructions. The order of

Page 13: Power Analysis of a 32-bit Embedded

POWER ANALYSIS 237

No. Instruction

TABLE VI Effect of circuit state overhead

Register contents I1 (mA) Ovh. (mA)

or %g0, 0, %102 or %g0, 0x001, %103 or %g0, 0x00f, %104 or %g0, 0xff, %105 or %g0, 0xfff, %106 & 2 opcode flip7 & 3 4 opcode flips8 & 4 8 opcode flips9 & 5 12 opcode flips10 or %g0, %i0, %10 (%i0-0)11 or %g0, %i0, %10 (%io=0xfff)12 10 & 11 12 data flips13 or %g0, %r16, %i0 (%r16-0)14 or %g0, %r17, %i0 (%r17-0)15 or %g0, %r15, %i0 (%r15=0)16 or %g0, %r23, %i0 (%r23-0)17 13 & 14 opcode flip18 13 & 16 3 opcode flips19 13 & 15 5 opcode flips20 subx %g0, %r16, %i0 (%r16-0)21 xor %g0, %r16, %i0 (%r16-0)22 xor %g0, %r17, %i0 (%r16=0)23 20 & 21 4 opcode flips24 20 & 22 4 opcode flips25 or %il, %00, %11 (%il =0x555, %00=0)26 add %00, %il, %12 (%il =0x555, %00=0)27 or %05, 0x555, %14 (%05= 1)28 srl %il, %05, %g3 (%il =0x555, %05= 1)29 25 & 2630 26 & 2731 27 & 2832 28 & 2533 fmuls %f8, %f4, %f0 (%f8=0, %f4=0)34 nop35 33 & 3436 andcc %gl, 0xaaa, %10 (%gl =0x555)37 33 & 3638 ld [0x555], %0539 sll %04, 0x7, %06 (%o4=0x707)40 38 & 3941 or %g0, 0xff, %1042 33 & 4143 fadd %f10, %f12, %f14 (%f10=0x123456,

%f12 0xaaaaaa)44 38 & 43

177174174174.5174.5178 2.5184.5 9191 15192 16177.5173.5187.5 12177.5176175.5176177.5180 3180.5 4172177.5176191.5 17192 18174.5174.5170174.5185 10.5182 9.5191 19184.5 10.5174176202 27179210 33.5192.5173.5202.5 19.5176207 32176

212 28

the instructions in the sequences is shown inColumn 2 and the I1 current cost is shown inColumn 3 (I2 current was 21 mA in all cases). Themaximum variation in the current is only 4 mA, orabout 1.9%. Since all the sequences take the samenumber of cycles, the energy variation is also1.9%.

Similar observations were also made in the caseof the Intel 486DX2 [13]. It appears that this is

characteristic of large, complex CPUs, where amajor part of the circuit activity is common to allinstructions, e.g., instruction fetch, pipeline con-trol, clocks, etc. However, it may be the case thatinstruction reordering can result in significantvariation in smaller processors (as seen in the caseof a DSP [7]), and processors with complex powermanagement features. This bears further investi-gation.

Page 14: Power Analysis of a 32-bit Embedded

238 V. T1WARI AND M. TIEN-CHIEN LEE

No.

TABLE VII Example of instruction reordering

Instruction Register contents

fmuls %f8, %f4, %f0 (%f8=0, %f4=0)2 andcc %gl, 0xaaa, %10 (%gl =0x555)3 faddd %10, %f12, %f14 (%f10=0x123456, %f12=0xaaaaaa)4 ld [0x555], %055 sll %04, ox7,%o6 (%o4=0x707)6 sub %i3, %i4, %i5 (%i3 =Ox7f, %i4--0x44)7 or %gO, Oxff, %10

Sequence I (mA)

a 1, 2, 3, 4, 5, 6, 7 206.5b 1, 3, 5, 7, 2, 4, 6 203c 1, 4, 7, 2, 5, 3, 6 205d 2, 3, 7, 6, 1, 5, 4 207e 5, 3, 1, 4, 6, 7, 2 202.5

10.2. Pipeline Stalls and Cache Misses

The ’934 is a pipelined processor. Resourceconstraints in the pipeline can lead to pipelinestalls that affect the energy cost of/programs.Examples are prefetch buffer stalls and write bufferstalls. Base costs of instructions do not reflect theimpact of these stalls. Therefore, current measure-ment experiments are conducted to isolate thepower cost of these stalls. The basic idea is to writeprograms where these stalls occur repeatedly. Thisis illustrated in greater detail in the context of theIntel 486DX2 in [14]. The energy cost of each kindof stall is proportional to the product of theaverage current during the stall and the number ofcycles involved in the stall, as given-by the formulain Section 3.Another inter-instruction effect that is not

reflected in the base costs is the impact of cachemisses. Base costs are determined in the context ofcache hits in both the instruction and data caches.Cache misses are modelled separately as an energyoverhead. This is analogous to what is done forestimating execution time, where a performancepenalty is incurred for each cache miss. The resultsin Section 6 give an idea of the power cost of acache miss. These costs can be further isolated byconducting experiments with programs wherecache misses occur repeatedly. The energy penaltyfor a cache miss is proportional to the product of

the average current during a cache miss and thenumber of cycles for which the CPU is stalled.

11. INSTRUCTION LEVEL POWER MODEL

The results of the previous section quantify theparameters of the instruction-level energy model ofthe ’934. It consists of base energy costs ofindividual instructions, and energy costs of effectsthat involve more than one instruction, e.g., effectsof circuit state, pipeline, and write buffer stalls.This model forms the basis for estimating theenergy cost of given programs. The followingsimple example illustrates the validity of thismodel.

11.1. Program Energy Estimation Example

Entries to 4 in Table VIII show a programconsisting of a sequence of four instructions. Thecaches were enabled but the entries were locked in.Columns 4 and 5 show the base current costs ofinstructions. Instruction 4 takes 2 cycles to executewhile instructions to 3 take each. The measuredcurrent for the full sequence is also shown. Thefirst step is to estimate the cost for the sequenceusing just the base costs. Multiplying the I1currents by the number of cycles for eachinstruction, summing these up, and then dividing

Page 15: Power Analysis of a 32-bit Embedded

POWER ANALYSIS 239

No. Instruction

TABLE VIII Program energy estimation example

Register contents I (mA) 12 (mA)

st %i0, [%10]orcc %il, %00, %11ldub [%10], %i5umul %i0, 0x2, %03l&22&33&44&lMeasured currentEstimate using base costsOverhead estimateFinal estimate

(%i0 0xaaa, %10=0)(%il =0x555, %00=0)(%i5 0xaaa, %10=0)(%i0 0xaaa)

176 21173 21192.5 21174.5 21203 21196.5 21192.5 21187.5 21196 21178.1 2116 0

194 21

by the total number of cycles, gives an estimate forthe I1 current for the sequence:

(176.0.1 + 173.0.1 + 192.5.1 + 174.5.2)/5178.1 mA

The estimate for 12 is obviously 21 mA. The inter-instruction effects should now be accounted for,which in this example only include the effect ofcircuit state. This effect is modeled by consideringthe circuit state overhead between each pair ofconsecutive instructions. The measured cost foreach pair is shown in entries 6 to 8. The circuitstate overhead between instructions 2 and 3 isgiven by (196.5-(173.0+192.5)/2)=13.75 mA.Between 3 and 4 it is given by (192.5-(192.5 + 2. 174.5)/3) 3/2= 18.0 mA. The reasonfor multiplying by 3/2 is that for an alternatingsequence of instructions 3&4, the overhead occurstwice during 3 cycles. In a similar way, theoverheads between & 2 and 4& are seen to be28.5 mA and 18.75 mA, respectively. The averageoverhead is 16 mA for I1 and 0 mA for 12. Whenthese overheads are added to the base costestimates, we obtain 194 mA for I1 and 21 mAfor 12. A comparison of entries 9 and 12 shows theclose correspondence of the estimate predicted bythe instruction-level power model and the actualestimate. Such a close correspondence was alsoobtained for other experiments involving otherinstruction sequences.

The overall instruction level power model forthe ’934 is summarized below. Given a program,P, its overall energy cost, Ep is given by:

i,j k

(1)

where for each instruction i, Bi is the base cost, andNi is the number of times it will be executed, andfor each pair of consecutive instructions (i, j), O,jis the circuit state overhead, and N,j is the numberof times the pair will be executed. Ek is the energycontribution of the other inter instruction effects, k(stalls and cache misses), that would occur duringthe execution of the program.The base cost values (Bi) are obtained as shown

in the previous sections. The circuit state overheadvalues (Oi,j) for all possible instruction pairs arealso obtained as shown in Section 10. However,given that the circuit state overhead varies in alimited range in the case of the ’934, it suggeststhat a constant value could be used for allinstruction pairs. This is a more efficient and yetfairly accurate way of modelling this effect. Avalue of 18 mA has been found to be suitable.The other parameters in the above formula vary

from program to program. The execution counts

Ni and N,j depend on the execution path of theprogram. This is dynamic, run-time information.In certain cases it can be determined statically butin general it is best obtained from a program

Page 16: Power Analysis of a 32-bit Embedded

240 V. TIWARI AND M. TIEN-CHIEN LEE

profiler. For estimating Ek, the number of timespipeline stalls and cache misses occur has to bedetermined. This is again dynamic informationthat can be statically predicted only in certaincases. In general, this information is obtained froma program profiler and cache simulator.The basic energy model developed in the

previous sections and described above is remark-ably similar to the model for the Intel 486DX2.Therefore, the software power/energy estimationframework that was developed for the 486DX2[14] can be directly applied to the ’934.

12. SOFTWARE CONTROLLED PARTIALPOWER DOWN

One of the unique features of the ’934 is a facilityfor powering down certain CPU modules that arenot needed. This is achieved by setting theappropriate bits in a system control register knownas the Power-Down Register. The modules thatcan be powered down are the SDRAM interface(SDI), the DMA module, the floating-point unit(FPU) and the floating-point FIFOs. Bits in thePower-Down Register that are set to cause theclock input of the corresponding module to bedisabled. Clearing the bits re-enables the clockinput.

Table IX shows the results of experiments thatstudy the effect of powering down specific mod-ules, or combinations of modules. The I currents

for two different instructions, an OR and a LD areshown in Columns 3 and 5, respectively. Thepercent reduction in current for each entry isshown in Columns 4 and 6. Entry shows theresults when nothing is powered down. Poweringdown different modules leads to different powersavings, the maximum being the case when all thefour modules, SDI, DMA, FIFO, and FPU arepowered down.

Observations and Comments

As can clearly be seen, powering down ofunneeded modules can lead to significant powersavings. However, powering down and laterpowering up a module through software, in-volves the execution of a certain number ofcontrol instructions. These instruction them-selves consume energy. Thus, powering down ofmodules will lead to energy savings only if themodules are powered down long enough tocompensate for the overhead involved in power-ing them down and then up.The previous observation also indicates thatautomatic power management of modules willbe more effective in saving power. The over-head of powering up and down will benegligible if it is controlled by logic internalto the CPU. Plus, the temporal resolution ofthe power management strategy can then bemuch finer it can even be performed on acycle by cycle basis.

3-6 No.

TABLE IX

Powered-down units

Software controlled partial power down

or %i0, 0, %10 ld [%10], %i0

I1 (mA). % saved /2 (mA) % saved

2345678910

None 177 0.0 192.5 0.0SDI 164 7.6 188.5 2.1DMA 164 7.6 188 2.3FPU 155.5 12.1 180 6,5FIFO 155.5 12.1 180 6.5DMA, FPU 151.5 14.4 174 9.6DMA, FIFO 151.5 14.4 175 9.1FIFO, FPU 142 19.8 166 13.8DMA, FIFO, FPU 138 22.0 162.5 15.6SDI, DMA, FIFO, FPU 133 25.0 157 18.4

Page 17: Power Analysis of a 32-bit Embedded

POWER ANALYSIS 241

An interesting observation leading from theabove table is that the power saving achieved bypowering down a combination of modulesdoesn’t necessarily equal the sum of theindividual power reductions. For example, forthe OR instruction, power savings in entry 0 arenot equal to the sum of the savings shown inentries 2 to 5. This is because circuit activity indifferent modules might be correlated for thisinstruction. Powering down one module alsoeliminates some activity in another module.Thus, powering down these modules togetherresults in different savings than what is expectedfrom the savings achieved by powering thendown individually.

This actually illustrates the fact that powerestimation/analysis methods have to account forcorrelations between the activities in variousmodules for each instruction. A power estimationmethod based on summing up typical powerconsumptions of separate modules, while disre-garding correlations, can be very inaccurate. Thus,the exact correlations have to be known in order toeffectively use such a method. The alternative is touse methods which estimate/analyze the powerconsumption of the entire CPU as a whole. Themeasurement based analysis method described inthis paper is such a method. It implicitly accountsfor all correlations between internal modules, sinceit is based on measurements made at the bound-aries of the processor.

13. CONCLUSIONS

This paper describes the application of a newpower analysis technique for analyzing the powerconsumption of the Fujitsu SPARClite MB86934,a RISC processor. This technique had earlier beenapplied to the Intel 486DX2, a CISC processor.The successful application of this technique forboth these processors points to its general applic-ability for other processors. This study reveals thatthe basic instruction-level power model of the ’934

is very similar to that of the 486DX2. This powermodel can be used to effectively evaluate the powercost of software, without requiring knowledge ofthe proprietary lower level details of the processor.The results of the analysis also provide valuableinformation about the power consumption in the’934. Besides suggesting several ideas for thedesign of power efficient software, this informationreveals other avenues for power reduction in theprocessor’s design.

Acknowledgment

The authors would like to thank that D. Mahesh-wari, H. Kotcherlakota, M. Somasundaram, A.Watanabe, and B. McKeever of Fujitsu Micro-electronics Inc. for their help with the experimentalsetup and for technical discussions.

References[1] Feigel, C. and Enfield, M., Fujitsu extends SPARClite

family. Microprocessor Report, June 1994.[2] Fujitsu Microelectronics Inc. SPARClite Embedded Pro-

cessor User’s Manual 1993.[3] Fujitsu Microelectronics Inc. SPARClite Embedded Pro-

cessor User’s Manual: MB86934 Addendum, 1994.[4] Huang, C. X., Zhang, B., Deng, A. C. and Swirski, B.,

The design and implementation of PowerMill. InProceedings ofthe International Symposium on Low PowerDesign, pp. 105-110, Dana Point, CA, April 1995.

[5] Landman, P. and Rabaey, J., Black-box capacitancemodels for architectural power analysis. In Proceedings ofthe International Workshop on Low Power Design, pp.165-170, Napa, CA, April 1994.

[6] Landman, P. and Rabaey, J., Activity-sensitive architec-tural power analysis for the control path. In Proceedingsof the International Symposium on Low Power Design, pp.93-98, Dana Point, CA, April 1995.

[7] Lee, T. C., Tiwari, V., Malik, S. and Fujita, M:, Poweranalysis and low-power scheduling techniques for em-bedded DSP software. In Proceedings of the InternationalSymposium on System Synthesis, Cannes, France, Sept.1995.

[8] Nagle, L. W. (1975). SPICE2: A computer program tosimulate semiconductor circuits. Technical Report ERL-M520, University of California, Berkeley.

[9] Salz, A. and Horowitz, M. (1989). IRSIM: An incre-mental MOS switch-level simulator. In Proceedings of theDesign Automation Conference, pages 173-178.

[10] Sato, T., Nagamatsu, M. and Tago, H., Power andperformance simulator: ESP and its application for100MIPS/W class RISC design. In Proceedings of 1994IEEE Symposium on Low Power Electronics, pp. 46-47,San Diego, CA, Oct. 1994.

Page 18: Power Analysis of a 32-bit Embedded

242 V. TIWARI AND M. TIEN-CHIEN LEE

[11] Lu, C. L., Tsui, C. Y. and Despain, A. M., Low powerarchitecture design and compilation techniques for high-performance processors. In IEEE COMPCON, Feb.1994.

[12] Tiwari, V., Malik, S. and Ashar, P., Guarded evaluation:Pushing power management to logic synthesis/design. InProceedings ofthe International Symposium on Low PowerDesign, pp. 221-226, Dana Point, CA, April 1995.

[13] Tiwari, V., Malik, S. and Wolfe, A., Compilationtechniques for low energy: An overview, in Proceedingsof 1994 IEEE Symposium on Low Power Electronics, pp.38-39, San Diego, CA, Oct. 1994.

[14] Tiwari, V., Malik, S. and Wolfe, A., Power analysis ofembedded software: A first step towards software powerminimization. IEEE Transactions on I/’LSI Systems, 2(4),437-445, December 1994.

[15] Ong, P. W. and Yan, R. H., Power-conscious softwaredesign- a framework for modeling software on hardware.In Proceedings of 1994 Symposium on Low PowerElectronics, pp. 36-37, San Diego, CA, Oct. 1994.

Authors’ Biographies

Vivek Tiwari received the B.Tech., degree inComputer Science and Engineering from theIndian Institute of Technology, New Delhi, Indiain 1991. Currently he is working towards thePh.D. degree in the Department of ElectricalEngineering, Princeton University.

His research interests are in the areas ofComputer Aided Design of VLSI and embeddedsystems and in microprocessor architecture. The

focus of his current research is on tools andtechniques for power estimation and low powerdesign. He has held summer positions at NECResearch Labs (1993), Intel Corporation (1994),Fujitsu Labs of America (1994), and IBM T. J.Watson Research Center (1995), where he workedon the above topics.He received the IBM Graduate Fellowship

Award in 1993, 1994, and 1995, and a Best PaperAward at ASP-DAC ’95.Mike Tien-Chien Lee received his B.S., degree

in Computer Science from National TaiwanUniversity in 1987, and the M.S., degree and thePh.D., degree in electrical engineering fromPrinceton University, in 1991 and 1993, respec-tively.He is currently a researcher at Fujitsu Labora-

tories of America, Santa Clara, CA. His researchinterests include low-power design, embedded sys-tem design, high-level synthesis, and test synthesis.He has served in the program committees of

IEEE Pacific Northwest Test Workshop and IEEEInternational Test Synthesis Workshop. He is alsoa consulting researcher at Center of ReliableComputing, Stanford University. He received aBest Paper Award at ASP-DAC ’95.

Page 19: Power Analysis of a 32-bit Embedded

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttp://www.hindawi.com Volume 2010

RoboticsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporation http://www.hindawi.com

Journal ofEngineeringVolume 2014

Submit your manuscripts athttp://www.hindawi.com

VLSI Design

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

DistributedSensor Networks

International Journal of