Power Control for Data Centers Ming Chen Oct. 8 th , 2009 ECE 692 Topic Presentation
2
Why power control in Data Centers?
Power is one of the most important computing resources.
Facility over-utilized− Dangerous− System failure and
overheating− Power below the capacity
Facility under-utilized− Cost of power facilities− Economically amortize
investment.− Provision to fully utilize power
facility.
3
Xiaorui Wang, Ming Chen University of Tennessee, Knoxville, TN
SHIP: Scalable Hierarchical Power Control for Large-Scale Data Centers
Charles Lefurgy, Tom W. Keller IBM Research, Austin, TX
4
Introduction
Power overload may cause system failures.− Power provisioning CANNOT guarantee exempt of overload.
− Over-provisioning may cause unnecessary expenses.
Power control for an entire data center is very necessary.
Data centers are expanding to meet new business requirement.− Cost-prohibitive to expand the power
facility.− Upgrades of power/cooling systems
lag far behind.− Example: NSA data center
5
Challenges
Scalability: One centralized controller for thousands of servers?
Coordination: if multiple controllers designed, how do they interact with each other?
Stability and accuracy: workload is time-varying and unpredictable.
Performance: how to allocate power budgets among different servers, racks, etc.?
6
State of The Art
Reduce power by improving energy-efficiency : [Lefurgy], [Nathuji], [Zeng], [Lu], [Brooks], [Horvath], [Chen]− NOT enforce power budget.
Power control for a server [Lefurgy], [Skadron], [Minerick],
a rack, [Wang], [Ranganathan], [Femal]− Cannot be directly applied for data centers.
No “Power” Struggles presents a multi-level power manager. [Raghavendra] − NOT designed based on power supply hierarchy− NO rigorous overall stability analysis− Only simulation results for 180 servers
7
What is This Paper About?
SHIP: a highly Scalable Hierarchical Power control architecture for large-scale data centers− Scalability: decompose the power control for a data center
into three levels
− Coordination: hierarchy is based on power distribution system in data centers.
− Stability and accuracy: theoretically guaranteed by Model Predicative Control (MPC) theory.
− Performance: differentiate power budget based on performance demands, i.e. utilization.
8
Power Distribution Hierarchy
A simplified example for a three-level data center
− Data center-level− PDU-level− Rack-level
Thousands of servers in total
9
PMRPC
PMRPC
UtilizationMonitor
FrequencyModulator
UM FM
UM FM
UM FM
Power Monitor
Rack PowerController
PDU PowerController
PDU-LevelPower Monitor
…
Rack-level PDU-level Data center-level
Controlled variable
The total power of the rack
The total power of the PDU
The total power of the data center
Manipulated variable
The CPU frequency of each server
The power budget of each rack
The power budget of each PDU
Control Architecture
HPCA08 paper This paper
10
PDU-level Power Model
System model:
Uncertainties:)()( kbrgkpr iii
gi is the power change ratio .
Actual model:
)(
...
)(
]...[)()1(1
1
kbr
kbr
ggkppkpp
N
N
:)(kpp the total power of PDU :)(kpri the power change of rack i
:)(kbri the change of power budget for rack i
)()()1(1
kprkppkppN
ii
11
Model Predictive Control (MPC)
Design steps:− Design a dynamic model for the controlled system.− Design the controller.− Analyze the stability and accuracy.
Control objective:
2
}1|)({))1((min s
NjkbrPkpp
j
s
jjjj
Pkpp
NjPkbrkbrPtosubject
)1(
)1()()(: max,min,
MPC Controller Design
12
Least Squares Solver
ReferenceTrajectory
Cost Function
ConstraintsSystemModel
21
0
2)(
1
||||||)|()|(||)( R(i)maxPk)|ibr(kk)|iΔbr(k
M
iiQ
P
i
kikrefkikppkV
Power budget
Measured power
)(
...
)(1
kbr
kbr
N
Budget changes
sP
)(kppIdeal trajectoryto track budget
Tracking error Control penalty
13
Stability
Local Stability− gi is assumed to be 1 at design time.
− gi is unknown a priori.
− 0 < gi < 14.8: 14.8 times of the allocated budget
Global Stability− Decouple controllers at different levels by running
them in different time scales.− The period of upper-level control loop > the
settling time of the lower-level− Sufficient but not necessary
14
System Implementation Physical testbed
− 10 Linux servers− Power meter (Wattsup)
• error: • sampling period: 1 sec
− Workload: HPL, SPEC− Controllers:
• period: 5s for rack, 30s for PDU Simulator (C++)
− Simulate large-scale data centers in three levels.− Utilization trace file from 5,415 servers in real data centers− Power model is based on experiments in servers.
%5.1
15
Precise Power Control (Testbed)
0 800 1600 2400600
800
1000
1200
PDU
Time (s)
Po
wer
(W
)
0 800 1600 2400200
240
280
320
360
Budget Rack 1Rack 2 Rack 3
Time (s)
Po
wer
(W
)
Power can be precisely controlled at the budget.
The budget can be reached within 4 control periods.
The power of each rack is controlled at their budgets.
Budgets are proportional to
.maxP
Tested for many power set points (See the paper for more results.)
16
0 800 1600 2400200
230
260
290
320
Rack1 Rack2
Time (s)
Po
wer
bu
dg
et (
W)
Power Differentiation (Testbed)
Capability to differentiate budgets based on workload to improve performance
Take the utilization as the optimization weights. Other differentiation metrics: response time,
throughput
Budget allocation proportional to estimated max consumptions;
Budgets differentiated by utilization;
CPU: 100%CPU: 80%
CPU: 50%
17
Simulation for Large-scale Data Centers
0 2 4 6 8 10 12 14400
500
600
700
800
Data centerSet point
Time (control period)
Po
wer
(kW
) 6 PDU, 270 racks Real data traces 750 kW
Randomly generate 3 data centers Real data traces
600 620 640 660 680 700 720 740 760 780
500
600
700
800
900
Data center 1 Data center 2Data center 3 Set point
Power set point (kW)
Po
wer
(kW
)
18
Budget Differentiation for PDUs
1 2 3 4 5 6 7 8 9 10 11 12 13 140
5
10
15
PDU1 PDU2 PDU3
PDU4 PDU5 PDU6
Time (control period)
CP
U u
tili
zati
on
(%
)
1 2 3 4 5 6 7 8 9 10 11 12 13 140
8
16
24
32
40
PDU1 PDU2 PDU3
PDU4 PDU5 PDU6
Time (control period)
Dif
fere
nce
(kW
)
Power differentiation in large-scale data centers;− Minimize the difference with estimated max power consumption.− Utilization is the weight.− The difference order is consistent with the utilization order.
PDU5
PDU2
19
Execution time of the MPC controller Vs. the # of servers
Scalability of SHIP
0 500 1000 1500 2000 2500 30000
3000
6000
9000
12000
0.09 (50)0.39 (100)
65.9452.1
3223.6
10997.5
Number of servers
Ex
ec
uti
on
tim
e (
se
c)
Centralized SHIP
Level One level Multiple
Computation overhead Large Small
Communication overhead Long Short
Scalability NO YES
Overhead of SHIP
The max scaleof centralized
20
Conclusion SHIP: a highly Scalable HIerarchical Power control
architecture for large-scale data centers− Three-levels: rack, PDU, and data center− MIMO controllers based on optimal control theory (MPC)− Theoretically guaranteed stability and accuracy− Discussion on coordination among controllers
Experiments on a physical testbed and a simulator− Precise power control− Budget differentiation− Scalable for large-scale data centers
21
Xiaobo Fan, Wolf-Dietrich Weber, Luiz Andre Barroso
Power Provisioning for a Warehouse-sized Computer
Acknowledgments: The organization order and contents of some slides are based on Xiaobo Fan’s slides in pdf.
22
Introduction
Strong economic incentives to fully utilize facilities− Investment is best amortized.− Upgrades without any new power facility investment
Power facilities
$10-$20/watt
years
utilization
~10 ~18
Electricity < $0.8/watt-year
Run risk of outages or costly violations of SLA.
Power provisioning given the budget
0.85
0.5
23
Reasons for Facility Under-utilization
Staged deployment− new facilities are rarely fully populated
Fragmentation Conservative machine power rating (nameplate) Statistical effects
− Larger machine population, lower probability of simultaneous
peaks
Variable load
24
What is This Paper About?
Investigate over-subscription potential to increase power facility utilization.− A light-weight and accurate model for estimating power− Long-term characterization of simultaneous power usage of a
large number of machines
Study of techniques for saving energy as well as peak power.− Power capping (physical testbed)− DVS (simulation)− Reduce idle power (simulation)
25
Data Center Power Distribution
Transformer
Main Supply
ATSSwitchBoard
UPS UPS
STSPDU
STSPDU
Pan
el
Pan
el
Generator
…
1000 kW
200 kW
50 kW
Rack
Circuit
2.5 kWRack level
40-80 servers
PDU level20-40 racks
Data center level5-10 PDUs
26
Power Estimation Model
Model is predicted for each family of machines. Greater interest is for a group of machines.
Direct measurements are not always available.
Input: CPU utilization
Models:− Pidle+(Pbusy – Pidle)u
− Pidle+(Pbusy – Pidle)(2u-ur)
− Measure and derive <Pidle ,Pbusy
, r>
27
Model Validation
PDU-level validation example (800 machines) Almost constant offset
− Loads not accounted in the model: networking equipments.
Relative error is below 1%.− EMc
i
ii
M
cEM
nerror
|(|1
28
Analysis Setup
Data center setup− Pick up more than 5,000 servers for each workload.− Rack: 40 machines, PDU: 800 machines, Cluster: 5000+
Monitoring period: 6 months every 10 mins Distribution of power usage
− Aggregate power at each time interval at different levels.− Normalized to aggregated peak power
Workload Description
Websearch Online servicing correlating with time of dayComputation-intensive
Webmail Disk I/O intensive.
Mapreduce Offline batch jobsLess correlation between activities and time of day
Real data center Randomly pick any machines from data centers
29
Webmail
65%
92%88%86%
72%
Higher level, narrower range− More difficult to improve facility utilization in lower
levels. Peak lowers as more machines are aggregated.
− 16% more machines can be deployed.
30
Websearch
45%
98%93%
52%
Peak lowers as more machines are aggregated.− 7% more machines can be deployed.
98% 93%
Higher level, narrower range− More difficult to improve facility utilization in lower
levels.
31
Real Data Centers
Clusters have much narrower dynamic range compared to racks.
Clusters peak at 72%.− 39% more machines
Mapreduce has the similar results.
32
Summary of Characterization
Workload Avg power Power range Machine increase
Websearch 68% 52%-93% 7%
Webmail 78% 72%-86% 16%
Mapreduce 70% 54%-90% 11%
Real data center 60% 51%-72% 39%
Average power: utilization of the power facilities Dynamic range: difficulty to improve facility utilization Peak power: potential of deployment over-subscription
33
CDF
1.0Time in power capping
Power
Powersaving
Time
Power CDF1.0
Power
Power Capping
Small fraction of time in power capping Substantial saving in peak power
Provide a safety valve when workload is unexpected.
34
Results for Power Capping
For workload with loose SLA or low priority Websearch and Webmail are excluded; De-scheduling tasks or DVFS
Motivation− A large portion of dynamic power is consumed by CPU.− DVS is widely available in modern CPUs.
CPU Voltage/Frequency Scaling
utilization
CPU power
threshold
Method− Oracle-style policy− Threshold: 5%, 20%, 50%− Simulation− CPU power is halved when
DVS is triggered.
35
36
Energy saving is larger than peak power reductions.
Biggest saving in data centers.
Benefits vary with workloads
Results for DVS
37
Lower Idle Power
Motivation− Idle power is high. (more
than 50% of peak)− Most of time is in
non-peak activity level.− What if idle power is
10% of peak?
keeping peak power
unchanged.− Simulation
utilization
CPU powerPeak
0.6
0.1
38
Conclusions
Power provisioning is important to amortize facility investment.
Load variation and statistical effects lead to facility under-utilization.
Over-subscribing deployment is more attractive in cluster level than rack level.
Three simple strategies to improve facility utilization: power capping, DVS, and lower idle power
39
Comparison of the Two Papers
SHIP Power Provisioning
Target Power capacity of data centers
Power capacity of data centers
Goal Control power to the budget to avoid facility over-utilization
Give power provisioning guidelines to avoid facility under-utilization
Methodology MIMO optimal control Statistical analysis
Solutions A complete control-based solution
Some strategies suggested based on real data analysis
Experiments Physical testbed and simulation based on real trace files
Detailed analysis on real trace files and simulations
40
Critiques Paper 1
− Workload is not typical in real data centers.− Power model may include CPU utilization.− No convincing baseline is compared.
Paper 2− Power provisioning Vs. performance violations− Power model is workload-sensitive.− Estimation accuracy in rack-level?− Quantitative analysis on idle power and peak power reduction