This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Quantifying the Energy Efficiency of Coordinated Micro-Architectural
Adaptation for Multimedia Workloads
Shrirang Yardi and Michael S. Hsiao
The Bradley Department of Electrical and Computer Engineering
Virginia Tech, Blacksburg, VA 24061. USA.
{yardi,mhsiao}@vt.edu
Abstract—Adaptive micro-architectures aim to achievegreater energy efficiency by dynamically allocating computingresources to match the workload performance. The decisionsof when to adapt (temporal dimension) and what to adapt(spatial dimension) are taken by a control algorithm basedon an analysis of the power/performance tradeoffs in bothdimensions. We perform a rigorous analysis to quantify theenergy efficiency limits of fine-grained temporal and coordi-nated spatial adaptation of multiple architectural resources bycasting the control algorithm as a constrained optimizationproblem. Our study indicates that coordinated adaptation canpotentially improve energy efficiency by up to 60% as comparedto static architectures and by up to 33% over algorithmsthat adapt resources in isolation. We also analyze synergisticapplication of coarse and fine grained adaptation and findmodest improvements of up to 18% over optimized dynamicvoltage/frequency scaling. Finally, we analyze several previouscontrol algorithms to understand the underlying reasons fortheir inefficiency.
I. INTRODUCTION
As transistor densities increase rapidly with each new
process technology and supply voltage decreases relatively
slowly, microprocessor power consumption has become a
critical operational constraint. Researchers have mainly used
two approaches to reduce microprocessor power. The first is
intelligent hardware design with static power saving tech-
niques (for e.g. clock/power gating unused components).
The second is to dynamically allocate just enough resources
to match the performance requirements of the application.
These adaptive approaches aim to achieve greater energy-
efficiency by exploiting the variability or execution slack
which arises due to the diverse execution characteristics of
different applications running on a static hardware. Examples
of such methods include adaptation of micro-architectural
structures [1] and system-level adaptation such as dynamic
voltage/frequency scaling (DVS) [16], among others.
Adaptive techniques typically exploit two types of exe-
cution slack to save energy: temporal slack which can be
exploited by slowing down the processor and resource slack
which can be exploited by re-sizing or de-activating parts of
the processor. The key to adaptation is the control algorithm
that decides when to adapt and what to adapt with the goal of
achieving energy-efficient operation [7]. Ideally, to maximize
energy efficiency, we would like to adapt frequently (tempo-
rally fine-grained) over an adaptive space of many resources
(spatially coordinated). This scenario of performing fine-
grained temporal and coordinated spatial adaptation is a
complex multi-dimensional optimization problem. To realize
the full potential of such adaptation, it is important to
perform a rigorous assessment of its benefits and costs.
This paper performs a detailed, off-line, quantitative anal-
ysis of the energy savings when adapting multiple resources
within a high-performance general-purpose microprocessor
running multimedia workloads. In this context, our goals
are to perform a comprehensive exploration of the adaptive
design space, quantify the potential efficiency benefits of
fine-grained and coordinated adaptation and identify the
limitations of existing techniques. If significant gains are
found, this can motivate further analysis and design of more
efficient adaptive hardware substrates and control algorithms.
A. Motivation
The following factors have motivated our study:
1) A considerable amount of research has been devoted to
the design of control algorithms for micro-architectural
adaptation (see Albonesi, et al. [1] for a survey).
However, due to the challenging multi-dimensionality
of the problem, prior techniques are largely ad-hoc
and have often constrained their analysis in either the
temporal or spatial dimensions. Temporal constraints
limit micro-architectural responsiveness to workload
heterogeneity and spatial constraints fail to account
for interactions between adaptive structures. Only a
rigorous and comprehensive exploration of the adaptive
design space can provide an accurate idea of the
potential efficiency benefits.
2) We focus on multimedia applications because, unlike
throughput-oriented workloads (such as SPEC), these
applications present a unique set of issues that warrant
their detailed study. First, these applications represent
a large (and sometimes the only) chunk of workloads
for the increasingly power-hungry mobile devices.
Second, these applications have markedly different
execution characteristics than throughput workloads so
that several multimedia-specific adaptation techniques
have been proposed [6], [7], [14]. It is important to
analyze such application-specific control algorithms
to determine the underlying reasons for their energy
(in)efficiency. However, our analysis framework is also
we account for structural adaptation overheads but assume
zero overhead for performing per-frame DVS.
III. MODELING STRUCTURAL ADAPTATION
A. Problem Formulation
Our approach for modeling fine-grained structural adap-
tations is based on previous work by Hughes, et al. [6]
and is described as follows. Within a frame, each epoch,
i, can be run with a different architectural configuration,
C j, where j ∈ A rch, the set of all possible configurations.
Each configuration has two attributes: a reward, which is the
energy saved by using C j instead of Base, and a cost, which
is the performance degradation due to C j for that epoch.
The goal is to determine a single configuration, C j, for each
epoch, i, such that these configurations together result in
the most energy saved while consuming no more than the
available temporal slack for the frame. We characterize the
reward in terms of energy-per-instruction (EPI) saved and
the cost in terms of the number of additional cycles, both
vs. Base to execute each epoch. Formally, we can state the
problem as:
maximize ∑i∈N
∑j∈A rch
Ei j ·Ci j subject to: (1)
∑i∈N
∑j∈A rch
Si j · (Ci j · A( j)) ≤ S f rame, (2)
∀i ∈ N , ∑j∈A rch
Ci j = 1, (3)
Ci j =
{
1 if config j is selected for interval i0 otherwise
(4)
Above, Ei j and Si j are the energy-per-instruction (EPI) saved
and the cycles required when using configuration j vs. Base
for epoch i. S f rame is the available slack for the frame and N
is the total number of epochs in the frame. A is a map from
the value of Ci j to the actual configuration to be used. Eqn.
3 guarantees that exactly one configuration is selected for
each epoch by using the decision variable Ci j. The products
in Eqn. 1 and 2 define the complete energy-performance
tradeoff space for the configurations in A rch. The optimal
solution is a vector C∗ = (C∗1 , . . . ,C∗
N) of configurations, one
per epoch, that provides the maximum energy savings. This
problem is an instance of the well-known multiple-choice
knapsack problem (MCKP) and is NP-hard [10]. Note that,
since the temporal slack and number of instructions vary
from frame-to-frame, we have to define one such problem
for each frame. We term this problem as OPT:FG.
B. Solving OPT:FG
To solve the optimization problem, we need the values of
Ei j and Si j for all frames, all configurations and all epochs.
We obtain these values using cycle-accurate instruction-level
simulation as follows.
We reconfigure the instruction window (IW), load store
queue (LSQ), number of integer ALUs, the number of FPUs
(floating-point units) and the issue width giving |A rch| =
25920 configurations. To reduce the number of simulations
and to maintain a balanced design, we adapt IW and LSQ
together and the ALUs and the issue width together. More
details about the different adaptive units are provided in
Section V. With these constraints, we need to perform 360
simulations for each frame to obtain the values for Ei j and
Si j required to solve the problem. For each application, we
profile several frames for all the configurations.
An intuitive idea about the solution can be given as fol-
lows. For each epoch, the most energy-efficient configuration
is the one that maximizes the tradeoff between EPI saved
and the cycles used. In other words, since each configuration
uses some part of the available temporal slack, C∗ provides
the best way to “distribute” the slack across the frame
by exploiting synergistic interactions between the adaptive
resources. Finally, to obtain the actual dynamic energy, we
simulate each frame using its optimal configurations.
IV. INTEGRATED STRUCTURAL AND SYSTEM
ADAPTATION
In the context of soft real-time systems, DVS has long
been applied as an effective frame-level technique [15],
where the processor voltage/frequency are scaled to save
energy while guaranteeing that the deadline is met. One
of our goals is to understand the interaction between these
adaptations and quantify the potential efficiency benefits
by applying them synergistically. As a simple example of
interaction between the two algorithms, an aggressive DVS
setting may allow the fine-grained algorithm to exercise a
wider range of configurations and conversely, a less ag-
gressive setting may leave little potential to exploit intra-
frame variability. This section describes our formulation to
determine the optimal way to apply these adaptations.
A. Problem Formulation
The objective is to select a single frequency/voltage for
the frame and a single configuration for each epoch within
a frame such that, together, they maximize the EPI savings
while consuming no more than the available slack for the
585
frame. Eqns. 5-11 formally state the problem. For Eqns. 5-
11, A rch,N ,Ci j and A have the same definitions as for
OPT:FG. V is the set of all possible voltage values (possibly
unbounded for a system supporting continuous DVS). Dk is
a binary variable that is set to 1 if voltage V (Dk) is selected
for the frame, where V maps k to a unique voltage/frequency
pair. Eqns. 7 and 8 guarantee that a single voltage value
is selected for the entire frame and a single configuration
is selected for each epoch. Eqn. 9 shows that, for ∀k ∈ V ,
S f rame depends on both V (Dk) and on the configuration set
C∗ = (C∗1 , . . . ,C∗
N) selected for the frame. We denote this
problem as OPT:CG+FG.
maximize ∑k∈V
∑i∈N
∑j∈A rch
Eki j ·Dk ·Ci j subject to: (5)
∀k ∈ V :N
∑i=1
|A rch|
∑j=1
Si jCi j ≤ S f rame,k (6)
∀k ∈ V :
|V |
∑k=1
Dk = 1, (7)
∀i ∈ N :
|A rch|
∑j=1
Ci j = 1 (8)
where,
∀k ∈ V :}
S f rame,k = Dk ·F (V (Dk),|N |
∑i=1
|A rch|
∑j=1
Ci j · A( j)), (9)
∀k ∈ V : Dk =
{
1 if voltage k is selected for the frame0 otherwise
(10)
Ci j =
{
1 if config j is selected for interval i0 otherwise
(11)
The solution to this problem provides, for each frame, (1)
a single voltage/frequency value, VCG, and (2) the optimal
configuration set C∗FG; which save the most energy while
consuming no more than the available slack for the frame.
Intuitively, by selecting the best voltage and configuration
set, the solution provides the best “split” of the available
slack between the two control algorithms.
B. Solving OPT:CG+FG
Since S f rame now depends on both voltage and the con-
figuration set, OPT:CG+FG is a mixed-integer, non-linear
problem (MI-NLP) and is infeasible even for industrial
solvers. One naive heuristic to solve it would be to discretize
[0,V ], effectively decoupling the voltage and configuration
selection. The is similar to solving OPT:FG repeatedly
with Ei j values scaled for each discrete voltage value. We
wish to avoid such decoupling to consider the interaction
between these adaptations and use the following heuristic to
accomplish this.
We use the amount of temporal slack as a knob to control
the relative aggressiveness (and hence energy efficiency) of
the CG and FG parts of OPT:CG+FG as follows. For the
candidate frame, let Tbase be the execution time for Base
and Smax be the maximum available temporal slack. Consider
the case when only structural adaptation is performed for
some slack SFG ≤ Smax. This is accomplished by solving
OPT:FG with S f rame = SFG to obtain the minimum energy
configuration set, C∗FG. Let TFG be the required execution
time and IPCFG be the average IPC. It follows that TFG =Tbase +SFG.
Next, consider the case that DVS is applied in addition
to structural adaptation to consume the remaining slack,
SCG = Smax − SFG. It follows that TCG = TFG + SCG, i.e. ,
TCG = Tbase+(SFG+SCG). The minimum frequency required
to consume SCG is then given by, fCG = ICountTCG×IPCBase
[7].
The goal of OPT:CG+FG then is to determine the best
”split” of Smax into SFG and SCG such that energy savings for
the frame are maximized. We discretize the interval [0,Smax]in to several candidate splits - we use values of 1% to 100%
of Smax in steps of 1%. For each split, we calculate SFGand SCG, solve OPT:FG to obtain C∗
FG, determine TFG and
IPCFG and determine fCG. Finally, we simulate the frame
using these values to obtain the split the gives the best energy
savings. In summary,
for each frame doTbase = frame execution time on Base Smax = deadline - TbaseICount = instruction count for this framefor split = 0.01 to 1 in steps of 0.01 do
Solve OPT:FG with S f rame = split to get C∗FG, IPCFG
TCG = TFG +(Smax× (1− split))fCG = ICount
TCG× IPCFG
EPIsplit = EPI at fCG,VCG with C∗FG
endLowest EPIsplit gives best VCG,C∗
FG
end
Algorithm 1: Slack splitting heuristic to solve OPT:CG+FG
The main advantage of the slack splitting approach over
the naive heuristic is that it allows a wider choice in selection
of voltage values which makes the solution closer to the
theoretical optimum. A discrete voltage would limit the
voltage choices and consequently the potential benefits.
V. SIMULATION SETUP
We use the execution-driven Simplescalar (v3. 0d) simula-
tor [3] for performance evaluation and the Wattch [2] tool to
track dynamic energy consumption. The base, non-adaptive
architecture is an aggressive 8-wide out-of-order superscalar
processor (parameters summarized in Table I).
Adaptive Structures Modeled: We assume a centralized
instruction window but with a separate register file. The
window is implemented as a circular FIFO without collaps-
ing and is split in to 8-entry segments [13]. We clock-gate
the empty and ready entries in the wake-up logic [5]. We
assume that the issue width of the core is the sum of all
active functional units [14]. When a functional unit is de-
activated, we also deactivate the corresponding parts of the
instruction selection logic, result bus and wake-up ports of
the instruction window.
Adaptation Overheads: To evaluate the best possible per-
formance of each adaptation algorithm, our study does not
model the adaptation overheads for DVS. For structural
adaptations, the delay overhead due to small additional
always resulted in worse performance than exploring the
cross-product of design points. The last three columns list
the fraction of deadlines missed for each algorithm for the
default deadlines. The deadlines missed for CG and CG+FG
were less than 3% in all cases.
This data highlights large design effort required for
prevalent threshold-based approaches. Even after application-
specific tuning, their behavior is unpredictable and (as we see
in the next sub-section), their energy benefits are limited.
VII. RESULTS FOR STRUCTURAL ADAPTATION
This section presents results for structural adaptation using
the default deadlines. We first summarize the potential energy
benefits of different algorithms. We then quantify the sources
of inefficiency of previous algorithms based on the manner in
which they consume the resource and temporal slack. We find
that the efficiency of OPT:FG is a result of judicious temporal
slack distribution and a comprehensive use of configuration
options.
A. Potential Energy Savings
Table IV summarizes the energy savings for each bench-
mark, expressed as percentage energy savings over Base for
each algorithm, averaged over all frames. For reference, we
also list the savings of OPT:FG relative to each algorithm,
which we term as the energy efficiency gap. This data illus-
trates the significant energy benefits by exploiting intra-frame
variability with mean potential savings of up to 60%. Energy
saved is proportional to the amount of intra-frame variability
- benchmarks with lower variability such as MPEG2-dec and
MP3-dec show modest savings (up to 47%), whereas those
with high variability, such as MPEG2-enc and Mesa, show
significant savings (up to 85%). In general, algorithms that
adapt structures together perform well, with CG+FG showing
savings within 13% of OPT:FG. Further tuning can likely
improve these savings. However, notice that Manual (12%
savings) performs worse than even IW (15% savings). This,
coupled with the high miss ratio of Manual (Table II), shows
that it is difficult to guarantee performance even if thresholds
are extensively hand-tuned for individual applications.
Table IV also quantifies the amount of temporal slack
consumed by each algorithm as the slowdown over Base,
averaged across all frames. In general, structural adaptation
is unable to consume large amounts of temporal slack which
indicates that most potential savings result by exploiting
intra-frame resource slack. This has significant implications
for coarse-grained algorithms. These algorithms can exploit
almost the entire temporal slack to save as much energy as
possible (detailed in Section VIII).
In what follows, we use the solution of OPT:FG to
quantify the underlying sources of inefficiency when con-
straining adaptation in the spatial and temporal dimensions.
We find that the net energy efficiency of OPT:FG results
from: (1) by using all the available configuration space and
by reconfiguring in moderate to large step sizes between
neighboring intervals, and (2) a strategic distribution of the
available temporal slack within the frame.
IW FU MAN CG CG+FGOPT
0
0.2
0.4
0.6
0.8
MPEG2−enc
IW FU MAN CG CG+FGOPT
0
0.2
0.4
0.6
0.8
1
Mesa−Texgen
IW FU MAN CG CG+FGOPT
0
0.2
0.4
0.6
0.8
1
H263−enc
IW FU MAN CG CG+FGOPT
0
0.2
0.4
0.6
0.8
1
H264−dec
Re
lative
Ma
gn
itu
de
of
Pa
ram
ete
r C
ha
ng
e
Fig. 1. Magnitude of Configuration Changes
B. Configurations Used
Figure 1 plots the magnitude of change in parameter values
across intervals in terms of step size [11]. The step size is the
difference in configuration parameters between successive
intervals expressed as a fraction of the total number of con-
figurations. For example, we have 15 choices for instruction
window size (16 to 128 entries in steps of 8). If window
size changes from 32 to 64 entries in successive intervals,
588
TABLE IV
ENERGY SAVINGS (%), ENERGY EFFICIENCY GAP (%) AND SLOWDOWN FOR DEFAULT DEADLINES
App. Savings (% Base Energy) OPT:FG Savings Relative to Slowdown (× Base Execution Time)IW FU MAN CG CG+FG OPT:FG IW FU MAN CG CG+FG MAN CG CG+FG OPT:FG
that constrain spatial adaptivity resulting in bottlenecks.
Finally, we observe that fine-grained temporal adaptivity is
better suited to localize energy costs by expending power
during epochs that actually need it, thus reducing waste and
increasing the net efficiency.
To make this study more extensive, recent advances in
statistical inference based techniques [4], [9], [11] can be
leveraged. These techniques perform efficient design space
exploration by using linear and/or non-linear predictive mod-
els to infer processor power/performace using fewer detailed
simulations. It will also be interesting to analyze adaptation
in the presence of SMT and/or CMP configurations that are
becoming common even for mobile devices.
IX. CONCLUSION
We have presented a detailed analysis of fine-grained
temporal and coordinated spatial micro-architectural adap-
tation by casting adaptation as a combinatorial optimization
problem. We also analyze the problem of integrating coarse-
grained adaptation with architectural adaptation using a novel
optimization model. Solutions to these models have allowed
an oracle-based assessment of the potential energy efficiency
benefits and an insight into the behavior of ideal control al-
gorithms. The solutions reveal significant efficiency benefits
resulting from a judicial use of available temporal slack and
comprehensive use of the adaptive space. A comparison with
several previous algorithms has demonstrated the impractica-
bility of threshold-based algorithms and the loss in efficiency
by constraining adaptation in either temporal or spatial or
both dimensions. Although our problem formulations are
conceptually simple, the analysis is much more complex due
to the high computational cost and multi-dimensionality of
the problem. Given the significant potential benefits, our next
step is to analyze control algorithm implementation options
in terms of their complexity and effectiveness.
REFERENCES
[1] D. H. Albonesi. et al. Dynamically tuning processor resources withadaptive processing. IEEE Computer, 36(12):49–58, 2003.
[2] D. Brooks et al. Wattch: a framework for architectural-level poweranalysis and optimizations. In ISCA, pages 83–94, 2000.
[3] D. Burger, T. M. Austin, and S. Bennett. Evaluating Future Micropro-cessors: The SimpleScalar Tool Set. Tech. Report CS-TR-1996-1308.
[4] S. Eyerman, L. Eeckhout, and K. D. Bosschere. Efficient design spaceexploration of high performance embedded out-of-order processors. InDATE, pages 351–356, 2006.
[5] D. Folegnani and A. Gonzalez. Energy-effective issue logic. InInternational Symposium on Computer Architecture (ISCA), pages230–239, 2001.
[6] C. J. Hughes and S. V. Adve. A formal approach to frequent energyadaptations for multimedia applications. In ISCA, pages 138–149,2004.
[7] C. J. Hughes, J. Srinivasan, and S. V. Adve. Saving energy witharchitectural and frequency adaptations for multimedia applications.In MICRO, pages 250–261, 2001.
[8] Intel Corporation. Intel Pentium M Processor Datasheet.[9] E. Ipek et al. Efficiently exploring architectural design spaces via
predictive modeling. In ASPLOS, 2006.[10] H. Kellerer, U. Pferschy, and D. Pisinger. Kanpsack Problems.
Springer, 2004.[11] B. C. Lee and D. Brooks. Efficiency trends and limits from compre-
hensive microarchitectural adaptivity. In ASPLOS, 2008.[12] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: A
tool for evaluating and synthesizing multimedia and communicatonssystems. In MICRO, pages 330–335, 1997.
[13] D. Ponomarev, G. Kucuk, and K. Ghose. Reducing power requirementsof instruction scheduling through dynamic allocation of multipledatapath resources. In MICRO, pages 90–101, 2001.
[14] R. Sasanka, C. J. Hughes, and S. V. Adve. Joint local and globalhardware adaptations for energy. In ASPLOS, pages 144–155, 2002.
[15] O. S. Unsal and I. Koren. System-Level Power-Aware DesignTechniques in Real-Time Systems. Proc. of IEEE, 91(7), Jul 2003.
[16] M. Weiser, B. B. Welch, A. J. Demers, and S. Shenker. Schedulingfor reduced cpu energy. In OSDI, pages 13–23, 1994.
[17] Y. -K. Chen et al. Media Applications on Hyper-Threading Technol-ogy. Intel Technology Journal, 6(1), February 2003.