Probabilistic Adaptive Load Balancing for Parallel Queries Daniel M. Yellin* Jorge Buenabad-Chávez** Norman W. Paton*** *IBM Israel Software Lab ** Centro de Investigacion y de Estudios Avanzados del IPN, Mexico *** University of Manchester
Probabilistic Adaptive Load Balancing for Parallel Queries
Daniel M. Yellin* Jorge Buenabad-Chávez** Norman W. Paton***
*IBM Israel Software Lab
** Centro de Investigacion y de Estudios Avanzados del IPN, Mexico
*** University of Manchester
Autonomic Computing
• Autonomic computing provides general framework for adaptive systems, MAPE
• … but getting the details right is tough
• When a trend is sensed, when to adapt?– Not too early, not too late
• What to adapt to?– Adapting in the wrong way can make things
worse!
Our problem• Given:
– A computational system S that can operate in one of several (possibly infinite) modes, m1,m2,..
– Each mode is optimized for a particular workload
• Goal:– Monitor the existing workload and decide when to adapt S to a different
mode, optimized for the current (or predicted) workload
• Considerations:– Risk: No promise that future workload will be similar to current workload– Cost: Each time we change the mode of the system S from mi to mj, we
incur a cost. Switching modes can be expensive!
m1m2
m4
m3m5
m1m2
m4
m3m5
Entails a cost
Example: Pub-sub systems
Given: S = pub-sub system, including a server and a set of clients
Modes = {cache a particular
data item on a client, store particular data item on server}
Goal: Monitor the access patterns of
clients and decide when to move data item to client (server) from server (client)
Daniel M. Yellin: Competitive algorithms for the dynamic selection of component implementations. IBM Systems Journal 42(1): 85-97 (2003).
Server
Client 1 Client 2
d1 d2 d3d3 …
read d3, read d3,...write d3, write d3,...
Example: Data type implementation
Given:
S = abstract data type with multiple implementations, each optimized for specific sorts of operations
Goal:
Monitor the operations on S and decide when to switch from one implementation to another
Component
impl1 data
Requests of type X faster using key K1
K1
impl2
Requests of type Y faster using key K2
K2
Our approach
1. Monitor existing workload and response times2. Determine (a finite number of) modes to
consider for adaptation3. Determine likelihood of (a finite number of)
workloads in the immediate future4. For each relevant mode, compute the
expected cost of switching to that mode, based upon probability of different workloads and cost of processing workload in that mode
Note: cost of adaptation (SwitchCost) is included in EC
Adaptive Query Processing
• A query optimiser, given a query and information on the data involved and the environment in which the query is to be run, proposes an execution plan for that query that is predicted to yield the best response time.
• If the information used by the optimiser is misleading (e.g. partial, incorrect, out-of-date or subject to change during query evaluation), the execution plan chosen by the optimiser may be inappropriate.
• In Adaptive Query Processing, the execution plan is modified at query runtime, on the basis of feedback received from the environment.
Adaptation for Load Balancing
• In partitioned parallelism, a task is divided into subtasks that are run in parallel on different nodes.
• For a join, A⋈B is represented as the union of the results of plan fragments Fi = Ai ⋈Bi , for i = 1..P, where P is the level of parallelism.
• The time taken to evaluate the join is max(evaluation_time(Fi )), for i = 1..P.
• As a result, any delay in completing a fragment Fi delays the completion of the operator, so it is crucial to match fragment size to node capabilities.
• Most join algorithms have state; as such changing the size of a fragment allocated to a machine involves replicating or relocating operator state.
Flux*• When load imbalance is
detected:– Halt query execution.– Compute new distribution
policy (dp).– Update hash tables by
transferring data between nodes.
– Update dp in parent exchange nodes.
– Resume query execution• Many variations of this
technique exists and have been compared ** Scan(A)
Join(A1,B1) Join(A2,B2)
Hash table A1
dp
Hash table A2
* M. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin, Flux: An Adaptive Partitioning Operator for Continuous Query Systems. ICDE 2003.
** Paton, N.W., Raman, V., Swart, G. and Narang, I., Autonomic Query Parallelization using Non-dedicated Computers: An Evaluation of Adaptivity Options, Proc. 3rd IEEE Intl. Conf. on Autonomic Computing, 2006.
Heuristics used by Flux
• Units of adaptation: table divided into partitions, and each node can gain/loose at most one partition during an adaptation
• Scale of adaptation: at most half the partitions can be moved at any adaptation
• Frequency of adaptation: once an adaptation takes place and takes time s, no further adaptation until after time s
• Timing of adaptation: applies heuristics to determine when to transfer partition from over-utilized processor to under-utilized processor
A brief review …
Our algorithm is based upon the concept of mathematical expectation.
If the probabilities of obtaining the amounts a1, a2,..., ak
are p1, p2,..., pk, where p1+ p2 +...+ pk = 1 then the mathematical expectation is:
E = a1 * p1 + a2 * p2 +...+ ak * pk
For example, if we win $10 when a die comes up 1 or 6, and lose $ 5 when it comes up 2, 3, 4 or 5, our mathematical expectation is:
E = 10*(2/6) + (-5)(4/6) = 0
Moving from heuristics to evidence-based decision making
Define the notion of expected cost (EC) of using a particular distribution policy dp– EC of dp is cost of processing the parallel query using
dp, given that in the future we will have actual workloads w1,w2,… with probabilities of p1,p2,…
EC(dp) = cost(dp,w1)*p1 + cost(dp,w2)*p2 + … + SwitchCost(current,dp)
– In practice, we only consider two workloads & two distribution policies: the currently used dp and the “optimal” one obtained from monitored workloads
Cost(dp,w1) computed how much longer it would take dp to finish
processing w1 than the optimal distribution policy. See paper for details.
SwitchCost is not present if dp is the currently used distribution
policy
Probabilistic Delta AlgorithmInitialize current_dp // initially distribute uniformly
TimeToSwitch = Falsewhile (not TimeToSwitch)
Process next portion of queryCompute preferred_dp // “ideal” distribution
ecNoChange = EC_NoChange(current_dp, preferred_dp, count)
ecChange = EC_Change(preferred_dp, current_dp, count)
if ecNoChange >= ecChange TimeToSwitch = True
endwhilecurrent_dp = preferred_dpAdapt to preferred_dp
Includes SwitchCost
Does notinclude
SwitchCost
Computing probabilities of future workloads
1. Let n_c be the number of workloads in the window that are most similar to current_dp
2. Let n_p be the number of workloads in the window that are most similar to preferred_dp
3. Let n_w be the total number of time units in the window.
4. prob(preferred_dp) = n_p / n_w and prob(current_dp)= n_c / n_w
Note: can use more sophisticated techniques; e.g., weight the workloads based on
“proximity” to current time
Experiment Setup (Simulator)
• Cost model parameters: drawn from micro benchmarks
• Database from TPC-H benchmark.• As number of nodes grows, the data is assumed
to be striped over the available machines.• All machines are assumed to have the same
capabilities, and to be sharing the same network.
• Experiments use Q1: P⋈PS (P has 200,000 tuples, PS has 800,000 tuples).
Same as in: Automatic Query Parallelization using Non-dedicated Compters: An Evaluation of Adaptivity Options, The Very Large Data Bases Journal, N. W. Paton, J. Buenabad-Chavez, M. Chen, V. Raman, G. Swart, I. Narang, D. M. Yellin, and A.A.A. Fernandez. To VLDB
Experiments
• Periodic imbalance: The load on one or more of the machines comes and goes during the experiment. The level, duration, and repeat duration of the external load are varied.
• Poisson imbalance: The arrival rate of jobs follows a Poisson distribution in which the average number of jobs starting per second varies.
• “Cyclic Poisson” imbalance: Like a Poisson distribution except the average workload is not constant but changes over time in cyclic fashion (like sine wave). Trying to mimic more realistic workloads that change over time.
Periodic load imbalance
Parallelism level =3
Single node affected
Duration & repeatduration of load spike = 1s
Level of imbalance= avg # of external jobs introduced
PD is more conservative in deciding to adapt
current dp =0 means adaptation taking place
PD adjusts only once to periodic increased load on node 1
Expected cost of adaptation is greater than expected cost of sticking with current distribution
For previousexperimentw/ level ofimbalance= 6
Each node start w/ 1/3 of workload but nodes 2 and 3 gain workload over time
Poisson load imbalance
Parallelism level =3
Single node affected
Duration of load spike = 1s
“Poisson cyclic” load imbalance
Parallelism level =3
Duration of Cycle = 5s
load spike = 1s
Future work
• Our approach is sensitive to window size. What is best window size to use?
• Investigate better techniques for computing the probability of future workloads
• Use more than just two alternative distribution policies; e.g., can we infer a trend and use a “predicted distribution policy”?
• Test the algorithm on a real system, not only with simulator
Conclusions
• We investigated replacing heuristics with a more fundamental approach for determining when to adapt the system
• Initial experiments showed that using Probabilistic Delta (expected cost) algorithm for determining when to adapt usually improved on existing approaches, sometimes significantly
• The gain of this approach is due to inhibiting specious adaptations while still encouraging necessary adaptations
Backup slides
Adaptive Parallel Queries
A distribution policy describes how we partition work between processors