ScienceCloud 2012 oresentation

Efficient Provisioning of Bursty Scientific Workloads on the Cloud Using Adaptive Elasticity Control

Ahmed Ali-Eldin, Johan Tordsson, and Erik ElmrothDepartment of Computing ScienceUme University, Swedenwww.cloudresearch.seMaria Kihl Lund Center for Control of Complex Engineering Systems Lund University, Swedenhttp://www.lccc.lth.se/

Ume UniversityContexteSSENCE is a strategic collaborative research programme in e-science between Uppsala University, Lund University and Ume University.eSSENCE strives to create a research environment that enables a strong interplay between e-science research, e-infrastructures, e-education, industry and society.

Motivation & Problem definitionThe cloud elasticity problemHow much capacity to (de)allocate to a cloud service (and when)? Bursty and unknown workloadIncrease ability to meet SLAsReduce resource usageOne of the limitations identified by Truong et al. [1] to the wide adoption of the cloud paradigm for scientific computing.

Problem DescriptionPrediction of load/signal/future is not a new problemStudied extensively within many disciplinesTime series analysisEconometricsControl theoryStock marketsBiology, etc.Multiple solutions proposed to prediction problemNeural networksFuzzy logicAdaptive controlRegressionKriging models However, solution must be suitable for our problem

RequirementsVary capacity allocated to a service According to current and future loadFulfill QoS requirements to meet SLAsWithout costly over-provisioningRobustnessAvoid oscillations or behavioral changesScalabilityTens of thousands of servers + even more VMsAdaptive to changing workloadsPID-controllers reliable for certain load patterns, but unstable once the load or system dynamics changeFastLimited look-ahead control accurate but too slow Can take 30 min to control 15 servers and 60 VMsSimplicityKey to adoption

Our approach: Adaptive Hybrid controlClosed loop controlAdaptive control:P-controllerAdjust error signal by gain parameterError signal is the difference between current and desired outputChange signal adjustments with load dynamicsHybrid control, a controller that combinesReactive control (step controller)Proactive control (proportional, P-controller)

Initial model and assumptions Service with homogeneous requestsShort requests that take one time unit (or less) to serveVM startup time is negligibleDelayed requests are droppedVM capacity constant Infrastructure modeled as G/G/N queue N (#VMs) varies over timePerfect load balancing assumed

A. Ali-Eldin, J. Tordsson, and E. Elmroth. An adaptive hybrid elasticity controller for cloud infrastructures. In NOMS 2012, IEEE/IFIP Network Operations and Management Symposium. IEEE, 2012.

Model and assumptionsAssumptions:Homogeneous requestsShort requests that take one time unit (or less)Machine startup time is negligibleDelayed requests are droppedConstant machine capacityInfrastructure modeled as G/G/N queue N (#VMs) varies over timePerfect load balancing assumed

Our approach (cont.)Adaptive control (cont.)How to estimate change in workload?

F = C * P

Two gain parameter alternatives studied Periodical rate of changeP = Load change / avg. rate in last time windowDenoted P_1 henceforth2. Ratio of load change over average system rate:P = Load change / avg. rate over all timeDenoted P_2 henceforth

Estimatedload change Average capacity in last time window Window size changes dynamically Smaller upon prediction errors A tolerance level decide how often window is resized

Gain parameterHybrid control (cont.)All in all, 9 approaches for scale up (U) and scale down (D)Reactively (R) and/or Proactively (P) UR combined with: DR, DP, DRPUP combined with:DR, DP, DRPURP combined with: DR, DP, DRP

Notation in the following: URP-DPScale up: reactive + proactiveScale down: proactive

Performance EvaluationSimulation-based evaluations3 aspects studied Best combination of reactive and proactive controllersController stability w.r.t. workload sizeComparison with state-of-the art controllerRegression control [Iqbal et al, FGCS 2011]

Performance metricsOver-provisioning: VMs allocated but not needed Under-provisioning: VMs needed, but failed to allocate (SLA violation)

Studied workloadFIFA98 traces~3 month Web server traces (bursty) Grouped requests per second of arrival

Best controller combination Scaled FIFA traces x 50 Reasonable Internet growth 1998 > todayAssume that 1 VM handles 500 requests Reasonable for DB-backend Web serversStudied (for sake of completion) all 9 combinations of reactive + proactive controller Some make no sense & indeed performed poorly:Reactive scale down causes oscillations and lot of under-provisioning (SLA violations) Pure proactive scale up tends to skew and cause under-provisioningOther approaches more promising:Reactive scale up Fast reaction to load increases, no skewProactive scale-downKeep VMs for a while (just in case) once they are allocated

Best combination(cont.) Baseline: UR-DR1.63% under-provisioning1.4% over-provisioning

Best combination(cont.) UR-DP_10.41% under-provisioning (1.63% for UR-DR)9.44% over-provisioning (1.4% for UR-DR)

Best combination(cont.)UR-DP_20.18% under-provisioning (1.63% for UR-DR)14.33% over-provisioning (1.4% for UR-DR)

Stability w.r.t workload size

Multiplied FIFA traces by X=10, 20, , 60Assume that 1 VM handles 10*X requests/sStudied UR-DR, UR-DP_1, UR-DP_2

Under-provisioning:Over-provisioning:

Conclusions: Reactive stable (no surprise) Proactive controller prediction quality varies with workloadError in over-provisioning grows slower than workload size

Comparison with regressionRegression-based control: Scale up: reactively, Scale down: regression2nd order regression based on full workload history Evaluation on selected (nasty) part of FIFA traceUR-DR:2.99% under-provisioning, 19.57% over-prov.UR-D_Regression:2.24% under-provisioning, 47% over-prov.UR-DP_1:1.51% under-provisioning, 32.24% over-prov.UR-DP_2:1.07% under-provisioning, 39.75% over-prov.Controller performance (execution time) Regression: 0.98s on average, up to 6.5s observedOur approach: 0.6 ms on averageConclusionsP-control promising approach to cloud elasticityAccurate predictionsRapid Controller execution time in ms Robust Copes with changes in workload dynamics

No one-size-fits all controllerTradeoff between over- and under-provisioningCosts for SLA violation (under-provisioning) and resource wastage (over-provisioning) decides strategy to use

ScienceCloud 2012 oresentation

Documents

load dynamicshybrid

adaptive control cont

cloud service

timeperfect load balancing

ratio of load change

time unit

cloud infrastructures

time windowdenoted p