Memory Power Management via Dynamic Voltage/Frequency Scaling Howard David (Intel) Eugene Gorbatov (Intel) Ulf R. Hanebutte (Intel) Chris Fallin (CMU) Onur Mutlu (CMU)
Memory Power Management via Dynamic Voltage/Frequency Scaling
Howard David (Intel) Eugene Gorbatov (Intel) Ulf R. Hanebutte (Intel)
Chris Fallin (CMU) Onur Mutlu (CMU)
Memory Power is Significant n Power consumption is a primary concern in modern servers n Many works: CPU, whole-system or cluster-level approach n But memory power is largely unaddressed n Our server system*: memory is 19% of system power (avg)
q Some work notes up to 40% of total system power
n Goal: Can we reduce this figure?
2
0 100 200 300 400
lbm
GemsFDT
D milc
leslie3d
libqu
antum
soplex
sphinx3
mcf
cactusAD
M
gcc
dealII
tonto
bzip2
gobm
k sje
ng
calculix
perlb
ench
h264ref
namd
grom
acs
gamess
povray
hmmer
Power (W
)
System Power Memory Power
*Dual 4-core Intel Xeon®, 48GB DDR3 (12 DIMMs), SPEC CPU2006, all cores active. Measured AC power, analytically modeled memory power.
Existing Solution: Memory Sleep States? n Most memory energy-efficiency work uses sleep states
q Shut down DRAM devices when no memory requests active
n But, even low-memory-bandwidth workloads keep memory awake q Idle periods between requests diminish in multicore workloads q CPU-bound workloads/phases rarely completely cache-resident
3
0% 2% 4% 6% 8%
lbm
GemsFDT
D
milc
leslie3d
libqu
antum
soplex
sphinx3
mcf
cactusAD
M
gcc
dealII
tonto
bzip2
gobm
k
sjeng
calculix
perlb
ench
h264ref
namd
grom
acs
gamess
povray
hmmer Time Spen
t in Sleep
States
Sleep State Residency
Memory Bandwidth Varies Widely n Workload memory bandwidth requirements vary widely
n Memory system is provisioned for peak capacity
à often underutilized
4
0
2
4
6
8
Band
width/cha
nnel (G
B/s)
Memory Bandwidth for SPEC CPU2006
Memory Power can be Scaled Down n DDR can operate at multiple frequencies à reduce power
q Lower frequency directly reduces switching power q Lower frequency allows for lower voltage q Comparable to CPU DVFS
n Frequency scaling increases latency à reduce performance q Memory storage array is asynchronous q But, bus transfer depends on frequency q When bus bandwidth is bottleneck, performance suffers
5
CPU Voltage/Freq.
System Power
Memory Freq.
System Power
↓ 15% ↓ 9.9% ↓ 40% ↓ 7.6%
Observations So Far n Memory power is a significant portion of total power
q 19% (avg) in our system, up to 40% noted in other works
n Sleep state residency is low in many workloads q Multicore workloads reduce idle periods q CPU-bound applications send requests frequently enough
to keep memory devices awake
n Memory bandwidth demand is very low in some workloads
n Memory power is reduced by frequency scaling q And voltage scaling can give further reductions
6
DVFS for Memory n Key Idea: observe memory bandwidth utilization, then
adjust memory frequency/voltage, to reduce power with minimal performance loss
à Dynamic Voltage/Frequency Scaling (DVFS) for memory
n Goal in this work: q Implement DVFS in the memory system, by: q Developing a simple control algorithm to exploit opportunity
for reduced memory frequency/voltage by observing behavior q Evaluating the proposed algorithm on a real system
7
Outline n Motivation
n Background and Characterization q DRAM Operation q DRAM Power q Frequency and Voltage Scaling
n Performance Effects of Frequency Scaling
n Frequency Control Algorithm
n Evaluation and Conclusions
8
Outline n Motivation
n Background and Characterization q DRAM Operation q DRAM Power q Frequency and Voltage Scaling
n Performance Effects of Frequency Scaling
n Frequency Control Algorithm
n Evaluation and Conclusions
9
DRAM Operation n Main memory consists of DIMMs of DRAM devices n Each DIMM is attached to a memory bus (channel)
10
Memory Bus (64 bits)
/8 /8 /8 /8 /8 /8 /8 /8
DRAM Operation n Main memory consists of DIMMs of DRAM devices n Each DIMM is attached to a memory bus (channel) n Multiple DIMMs can connect to one channel
11
to Memory Controller
Inside a DRAM Device
13
Bank 0
Sense Amps Column Decoder
Row
Dec
oder
Banks • Independent arrays • Asynchronous:
independent of memory bus speed
Inside a DRAM Device
14
Bank 0
Sense Amps Column Decoder
Row
Dec
oder
Reci
ever
s D
river
s
Regi
ster
s
Writ
e FI
FO
I/O Circuitry • Runs at bus speed • Clock sync/distribution • Bus drivers and receivers • Buffering/queueing
Inside a DRAM Device
15
Bank 0
Sense Amps Column Decoder
Row
Dec
oder
ODT
Reci
ever
s D
river
s
Regi
ster
s
Writ
e FI
FO
On-Die Termination • Required by bus electrical characteristics
for reliable operation • Resistive element that dissipates power
when bus is active
Inside a DRAM Device
16
Bank 0
Sense Amps Column Decoder
Row
Dec
oder
ODT
Reci
ever
s D
river
s
Regi
ster
s
Writ
e FI
FO
Effect of Frequency Scaling on Power n Reduced memory bus frequency: n Does not affect bank power:
q Constant energy per operation q Depends only on utilized memory bandwidth
n Decreases I/O power: q Dynamic power in bus interface and clock circuitry
reduces due to less frequent switching n Increases termination power:
q Same data takes longer to transfer q Hence, bus utilization increases
n Tradeoff between I/O and termination results in a net power reduction at lower frequencies
17
Effects of Voltage Scaling on Power n Voltage scaling further reduces power because all parts of
memory devices will draw less current (at less voltage) n Voltage reduction is possible because stable operation
requires lower voltage at lower frequency:
18
1 1.1 1.2 1.3 1.4 1.5 1.6
1333MHz 1066MHz 800MHz
DIMM Voltage (V
)
Minimum Stable Voltage for 8 DIMMs in a Real System
Vdd for Power Model
Outline n Motivation
n Background and Characterization q DRAM Operation q DRAM Power q Frequency and Voltage Scaling
n Performance Effects of Frequency Scaling
n Frequency Control Algorithm
n Evaluation and Conclusions
19
How Much Memory Bandwidth is Needed?
20
0 1 2 3 4 5 6 7
lbm
GemsFDT
D milc
leslie3d
libqu
antum
soplex
sphinx3
mcf
cactusAD
M
gcc
dealII
tonto
bzip2
gobm
k sje
ng
calculix
perlb
ench
h264ref
namd
grom
acs
gamess
povray
hmmer Band
width/cha
nnel (G
B/s)
Memory Bandwidth for SPEC CPU2006
Performance Impact of Static Frequency Scaling
21
n Performance impact is proportional to bandwidth demand n Many workloads tolerate lower frequency with minimal
performance drop
0 10 20 30 40 50 60 70 80
lbm
GemsFDT
D milc
leslie3d
libqu
antum
soplex
sphinx3
mcf
cactusAD
M
gcc
dealII
tonto
bzip2
gobm
k sje
ng
calculix
perlb
ench
h264ref
namd
grom
acs
gamess
povray
hmmer
Performan
ce Drop (%
)
Performance Loss, StaNc Frequency Scaling
1333-‐>800
1333-‐>1066
Performance Impact of Static Frequency Scaling
22
n Performance impact is proportional to bandwidth demand n Many workloads tolerate lower frequency with minimal
performance drop
0
2
4
6
8
lbm
GemsFDT
D milc
leslie3d
libqu
antum
soplex
sphinx3
mcf
cactusAD
M
gcc
dealII
tonto
bzip2
gobm
k sje
ng
calculix
perlb
ench
h264ref
namd
grom
acs
gamess
povray
hmmer
Performan
ce D
rop (%
)
Performance Loss, StaNc Frequency Scaling
1333-‐>800 1333-‐>1066
:: :: :: :: :: :: :: :: :: :: : : : : : :
Outline n Motivation
n Background and Characterization q DRAM Operation q DRAM Power q Frequency and Voltage Scaling
n Performance Effects of Frequency Scaling
n Frequency Control Algorithm
n Evaluation and Conclusions
23
Memory Latency Under Load n At low load, most time is in array access and bus transfer
à small constant offset between bus-frequency latency curves
n As load increases, queueing delay begins to dominate à bus frequency significantly affects latency
24
60 90
120 150 180
0 2000 4000 6000 8000
Latency (ns)
UNlized Channel Bandwidth (MB/s)
Memory Latency as a FuncNon of Bandwidth and Mem Frequency
800MHz 1067MHz 1333MHz
Control Algorithm: Demand-Based Switching
After each epoch of length Tepoch: Measure per-channel bandwidth BW if BW < T800 : switch to 800MHz else if BW < T1066 : switch to 1066MHz else : switch to 1333MHz
25
60 90
120 150 180
0 2000 4000 6000 8000
Latency (ns)
UNlized Channel Bandwidth (MB/s)
Memory Latency as a FuncNon of Bandwidth and Mem Frequency 800MHz 1067MHz 1333MHz
T1066 T800
Implementing V/F Switching n Halt Memory Operations
q Pause requests q Put DRAM in Self-Refresh q Stop the DIMM clock
n Transition Voltage/Frequency q Begin voltage ramp q Relock memory controller PLL at new frequency q Restart DIMM clock q Wait for DIMM PLLs to relock
n Begin Memory Operations q Take DRAM out of Self-Refresh q Resume requests
26
Implementing V/F Switching n Halt Memory Operations
q Pause requests q Put DRAM in Self-Refresh q Stop the DIMM clock
n Transition Voltage/Frequency q Begin voltage ramp q Relock memory controller PLL at new frequency q Restart DIMM clock q Wait for DIMM PLLs to relock
n Begin Memory Operations q Take DRAM out of Self-Refresh q Resume requests
27
C Memory frequency already adjustable statically
C Voltage regulators for CPU DVFS can work for memory DVFS
C Full transition takes ~20µs
Outline n Motivation
n Background and Characterization q DRAM Operation q DRAM Power q Frequency and Voltage Scaling
n Performance Effects of Frequency Scaling
n Frequency Control Algorithm
n Evaluation and Conclusions
28
Evaluation Methodology n Real-system evaluation
q Dual 4-core Intel Xeon®, 3 memory channels/socket
q 48 GB of DDR3 (12 DIMMs, 4GB dual-rank, 1333MHz)
n Emulating memory frequency for performance q Altered memory controller timing registers (tRC, tB2BCAS) q Gives performance equivalent to slower memory frequencies
n Modeling power reduction q Measure baseline system (AC power meter, 1s samples) q Compute reductions with an analytical model (see paper)
29
Evaluation Methodology
n Workloads q SPEC CPU2006: CPU-intensive workloads q All cores run a copy of the benchmark
n Parameters q Tepoch = 10ms q Two variants of algorithm with different switching thresholds: q BW(0.5, 1): T800 = 0.5GB/s, T1066 = 1GB/s q BW(0.5, 2): T800 = 0.5GB/s, T1066 = 2GB/s
à More aggressive frequency/voltage scaling
30
Performance Impact of Memory DVFS n Minimal performance degradation: 0.2% (avg), 1.7% (max)
31
-‐1
0
1
2
3
4
lbm
GemsFDT
D milc
leslie3d
libqu
antum
soplex
sphinx3
mcf
cactusAD
M
gcc
dealII
tonto
bzip2
gobm
k sje
ng
calculix
perlb
ench
h264ref
namd
grom
acs
gamess
povray
hmmer
AVG
Performan
ce Degrada
Non (%
)
BW(0.5,1) BW(0.5,2)
Performance Impact of Memory DVFS n Minimal performance degradation: 0.2% (avg), 1.7% (max) n Experimental error ~1%
32
-‐1
0
1
2
3
4
lbm
GemsFDT
D milc
leslie3d
libqu
antum
soplex
sphinx3
mcf
cactusAD
M
gcc
dealII
tonto
bzip2
gobm
k sje
ng
calculix
perlb
ench
h264ref
namd
grom
acs
gamess
povray
hmmer
AVG
Performan
ce Degrada
Non (%
)
BW(0.5,1) BW(0.5,2)
Memory Frequency Distribution n Frequency distribution shifts toward higher memory frequencies with more memory-intensive benchmarks
33
0%
20%
40%
60%
80%
100%
lbm
GemsFDT
D milc
leslie3d
libqu
antum
soplex
sphinx3
mcf
cactusAD
M
gcc
dealII
tonto
bzip2
gobm
k sje
ng
calculix
perlb
ench
h264ref
namd
grom
acs
gamess
povray
hmmer
1333 1066 800
Memory Power Reduction n Memory power reduces by 10.4% (avg), 20.5% (max)
34
0
5
10
15
20
25
lbm
GemsFDT
D milc
leslie3d
libqu
antum
soplex
sphinx3
mcf
cactusAD
M
gcc
dealII
tonto
bzip2
gobm
k sje
ng
calculix
perlb
ench
h264ref
namd
grom
acs
gamess
povray
hmmer
AVG Mem
ory Po
wer Red
ucNo
n (%
)
BW(0.5,1) BW(0.5,2)
System Power Reduction
35
0 0.5 1
1.5 2
2.5 3
3.5 4
lbm
GemsFDT
D milc
leslie3d
libqu
antum
soplex
sphinx3
mcf
cactusAD
M
gcc
dealII
tonto
bzip2
gobm
k sje
ng
calculix
perlb
ench
h264ref
namd
grom
acs
gamess
povray
hmmer
AVG Sy
stem
Pow
er Red
ucNo
n (%
)
BW(0.5,1) BW(0.5,2)
n As a result, system power reduces by 1.9% (avg), 3.5% (max)
n System energy reduces by 2.4% (avg), 5.1% (max)
System Energy Reduction
36
-‐1 0 1 2 3 4 5 6
lbm
GemsFDT
D milc
leslie3d
libqu
antum
soplex
sphinx3
mcf
cactusAD
M
gcc
dealII
tonto
bzip2
gobm
k sje
ng
calculix
perlb
ench
h264ref
namd
grom
acs
gamess
povray
hmmer
AVG
System
Ene
rgy Re
ducNon
(%)
BW(0.5,1) BW(0.5,2)
Related Work n MemScale [Deng11], concurrent work (ASPLOS 2011)
q Also proposes Memory DVFS q Application performance impact model to decide voltage and
frequency: requires specific modeling for a given system; our bandwidth-based approach avoids this complexity
q Simulation-based evaluation; our work is a real-system proof of concept
n Memory Sleep States (Creating opportunity with data placement
[Lebeck00,Pandey06], OS scheduling [Delaluz02], VM subsystem [Huang05]; Making better decisions with better models [Hur08,Fan01])
n Power Limiting/Shifting (RAPL [David10] uses memory throttling for thermal limits; CPU throttling for memory traffic [Lin07,08]; Power shifting across system [Felter05])
37
Conclusions n Memory power is a significant component of system power
q 19% average in our evaluation system, 40% in other work
n Workloads often keep memory active but underutilized q Channel bandwidth demands are highly variable q Use of memory sleep states is often limited
n Scaling memory frequency/voltage can reduce memory power with minimal system performance impact q 10.4% average memory power reduction q Yields 2.4% average system energy reduction
n Greater reductions are possible with wider frequency/voltage range and better control algorithms
38
Memory Power Management via Dynamic Voltage/Frequency Scaling
Howard David (Intel) Eugene Gorbatov (Intel) Ulf R. Hanebutte (Intel)
Chris Fallin (CMU) Onur Mutlu (CMU)
Why Real-System Evaluation? n Advantages:
q Capture all effects of altered memory performance n System/kernel code, interactions with IO and peripherals, etc
q Able to run full-length benchmarks (SPEC CPU2006) rather than short instruction traces
q No concerns about architectural simulation fidelity n Disadvantages:
q More limited room for novel algorithms and detailed measurements
q Inherent experimental error due to background-task noise, real power measurements, nondeterministic timing effects
n For a proof-of-concept, we chose to run on a real system in order to have results that capture all potential side-effects of altering memory frequency
40
CPU-Bound Applications in a DRAM-rich system n We evaluate CPU-bound workloads with 12 DIMMs:
what about smaller memory, or IO-bound workloads?
n 12 DIMMs (48GB): are we magnifying the problem? q Large servers can have this much memory, especially for database
or enterprise applications q Memory can be up to 40% of system power [1,2], and reducing its
power in general is an academically interesting problem
n CPU-bound workloads: will it matter in real life? q Many workloads have CPU-bound phases (e.g., database scan or
business logic in server workloads) q Focusing on CPU-bound workloads isolates the problem of varying
memory bandwidth demand while memory cannot enter sleep states, and our solution applies for any compute phase of a workload
41
[1] L. A. Barroso and U. Holzle. “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.” Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009. [2] C. Lefurgy et al. “Energy Management for Commercial Servers.” IEEE Computer, pp. 39—48, December 2003.
Combining Memory & CPU DVFS? n Our evaluation did not incorporate CPU DVFS:
q Need to understand effect of single knob (memory DVFS) first q Combining with CPU DVFS might produce second-order effects
that would need to be accounted for
n Nevertheless, memory DVFS is effective by itself, and mostly orthogonal to CPU DVFS: q Each knob reduces power in a different component q Our memory DVFS algorithm has neligible performance impact
à negligible impact on CPU DVFS q CPU DVFS will only further reduce bandwidth demands relative
to our evaluations à no negative impact on memory DVFS
42
Why is this Autonomic Computing? n Power management in general is autonomic: a system
observes its own needs and adjusts its behavior accordingly à Lots of previous work comes from architecture community, but crossover in ideas and approaches could be beneficial
n This work exposes a new knob for control algorithms to turn, has a simple model for the power/energy effects of that knob, and observes opportunity to apply it in a simple way
n Exposes future work for: n More advanced control algorithms n Coordinated energy efficiency across rest of system n Coordinated energy efficiency across a cluster/datacenter,
integrated with memory DVFS, CPU DVFS, etc.
43