Efficient Provisioning of Bursty Scientific Workloads on the
Cloud Using Adaptive Elasticity Control
Ahmed Ali-Eldin, Johan Tordsson, and Erik ElmrothDepartment of
Computing ScienceUme University, Swedenwww.cloudresearch.seMaria
Kihl Lund Center for Control of Complex Engineering Systems Lund
University, Swedenhttp://www.lccc.lth.se/
Ume UniversityContexteSSENCE is a strategic collaborative
research programme in e-science between Uppsala University, Lund
University and Ume University.eSSENCE strives to create a research
environment that enables a strong interplay between e-science
research, e-infrastructures, e-education, industry and society.
Motivation & Problem definitionThe cloud elasticity
problemHow much capacity to (de)allocate to a cloud service (and
when)? Bursty and unknown workloadIncrease ability to meet
SLAsReduce resource usageOne of the limitations identified by
Truong et al. [1] to the wide adoption of the cloud paradigm for
scientific computing.
Problem DescriptionPrediction of load/signal/future is not a new
problemStudied extensively within many disciplinesTime series
analysisEconometricsControl theoryStock marketsBiology,
etc.Multiple solutions proposed to prediction problemNeural
networksFuzzy logicAdaptive controlRegressionKriging models
However, solution must be suitable for our problem
RequirementsVary capacity allocated to a service According to
current and future loadFulfill QoS requirements to meet SLAsWithout
costly over-provisioningRobustnessAvoid oscillations or behavioral
changesScalabilityTens of thousands of servers + even more
VMsAdaptive to changing workloadsPID-controllers reliable for
certain load patterns, but unstable once the load or system
dynamics changeFastLimited look-ahead control accurate but too slow
Can take 30 min to control 15 servers and 60 VMsSimplicityKey to
adoption
Our approach: Adaptive Hybrid controlClosed loop controlAdaptive
control:P-controllerAdjust error signal by gain parameterError
signal is the difference between current and desired outputChange
signal adjustments with load dynamicsHybrid control, a controller
that combinesReactive control (step controller)Proactive control
(proportional, P-controller)
Initial model and assumptions Service with homogeneous
requestsShort requests that take one time unit (or less) to serveVM
startup time is negligibleDelayed requests are droppedVM capacity
constant Infrastructure modeled as G/G/N queue N (#VMs) varies over
timePerfect load balancing assumed
A. Ali-Eldin, J. Tordsson, and E. Elmroth. An adaptive hybrid
elasticity controller for cloud infrastructures. In NOMS 2012,
IEEE/IFIP Network Operations and Management Symposium. IEEE,
2012.
Model and assumptionsAssumptions:Homogeneous requestsShort
requests that take one time unit (or less)Machine startup time is
negligibleDelayed requests are droppedConstant machine
capacityInfrastructure modeled as G/G/N queue N (#VMs) varies over
timePerfect load balancing assumed
Our approach (cont.)Adaptive control (cont.)How to estimate
change in workload?
F = C * P
Two gain parameter alternatives studied Periodical rate of
changeP = Load change / avg. rate in last time windowDenoted P_1
henceforth2. Ratio of load change over average system rate:P = Load
change / avg. rate over all timeDenoted P_2 henceforth
Estimatedload change Average capacity in last time window Window
size changes dynamically Smaller upon prediction errors A tolerance
level decide how often window is resized
Gain parameterHybrid control (cont.)All in all, 9 approaches for
scale up (U) and scale down (D)Reactively (R) and/or Proactively
(P) UR combined with: DR, DP, DRPUP combined with:DR, DP, DRPURP
combined with: DR, DP, DRP
Notation in the following: URP-DPScale up: reactive +
proactiveScale down: proactive
Performance EvaluationSimulation-based evaluations3 aspects
studied Best combination of reactive and proactive
controllersController stability w.r.t. workload sizeComparison with
state-of-the art controllerRegression control [Iqbal et al, FGCS
2011]
Performance metricsOver-provisioning: VMs allocated but not
needed Under-provisioning: VMs needed, but failed to allocate (SLA
violation)
Studied workloadFIFA98 traces~3 month Web server traces (bursty)
Grouped requests per second of arrival
Best controller combination Scaled FIFA traces x 50 Reasonable
Internet growth 1998 > todayAssume that 1 VM handles 500
requests Reasonable for DB-backend Web serversStudied (for sake of
completion) all 9 combinations of reactive + proactive controller
Some make no sense & indeed performed poorly:Reactive scale
down causes oscillations and lot of under-provisioning (SLA
violations) Pure proactive scale up tends to skew and cause
under-provisioningOther approaches more promising:Reactive scale up
Fast reaction to load increases, no skewProactive scale-downKeep
VMs for a while (just in case) once they are allocated
Best combination(cont.) Baseline: UR-DR1.63%
under-provisioning1.4% over-provisioning
Best combination(cont.) UR-DP_10.41% under-provisioning (1.63%
for UR-DR)9.44% over-provisioning (1.4% for UR-DR)
Best combination(cont.)UR-DP_20.18% under-provisioning (1.63%
for UR-DR)14.33% over-provisioning (1.4% for UR-DR)
Stability w.r.t workload size
Multiplied FIFA traces by X=10, 20, , 60Assume that 1 VM handles
10*X requests/sStudied UR-DR, UR-DP_1, UR-DP_2
Under-provisioning:Over-provisioning:
Conclusions: Reactive stable (no surprise) Proactive controller
prediction quality varies with workloadError in over-provisioning
grows slower than workload size
Comparison with regressionRegression-based control: Scale up:
reactively, Scale down: regression2nd order regression based on
full workload history Evaluation on selected (nasty) part of FIFA
traceUR-DR:2.99% under-provisioning, 19.57%
over-prov.UR-D_Regression:2.24% under-provisioning, 47%
over-prov.UR-DP_1:1.51% under-provisioning, 32.24%
over-prov.UR-DP_2:1.07% under-provisioning, 39.75%
over-prov.Controller performance (execution time) Regression: 0.98s
on average, up to 6.5s observedOur approach: 0.6 ms on
averageConclusionsP-control promising approach to cloud
elasticityAccurate predictionsRapid Controller execution time in ms
Robust Copes with changes in workload dynamics
No one-size-fits all controllerTradeoff between over- and
under-provisioningCosts for SLA violation (under-provisioning) and
resource wastage (over-provisioning) decides strategy to use