Dynamic Load Balancing in Parallel and Distributed ...

Dynamic Load Balancing in Parallel and Distributed Networks

by Random Matchings

(Extended Abstract)

Bhaskar Ghosh*

Abstract

The fundamental problems in dynamic load balancing

and job scheduling in parallel and distributed comput-

ers involve moving load between processors. In this

paper, we consider a new model for load movement

in synchronous parallel and distributed machines. In

each step of our model, each processor can transfer

load to at most one neighbor; also, any amount of

load can be moved along a communication link be-

tween two processors in one step. This is a reason-

able model for load movement in significant classes of

dynamic load balancing problems.

We derive efficient algorithms for a number of task

reallocation problems under our model of load move-

ment. These include dynamic load balancing on pr~

cessor networks, adaptive mesh re-partitioning such as

those in finite element methods, and progressive job

migration under dynamic generation and consump-

tion of load.

To obtain the abov~mentioned results, we intro-

duce and solve the abstract problem of Incremental

Weight Migration (IWM) on arbitrary graphs. Our

main result is a simple, randomized, algorithm for this

problem which provably results in asymptotically op-

timal convergence towards the state where weights on

the nodes of the graph are all equal. This algorithm

*Department of Computer Science, Yale University, P. O. Box208285, New Haven, CT 06520. Internet : [email protected].

Reseerch supported by ONR under grant number 491-J-1576

and a Yale/IBM joint study.

t Courant Institute of Mathematical Sciences, New York

University, 251 Mercer Street, New York, NY 10012-1185, USA;

[email protected], (212) 998-3061. The research of this authorwas supported in part by NSF/DARPA under grant number

CCR-89-06949 and by NSF under grant number CCR-91-03953.

Permission to co y without fee all or part of this material iscl’~ranted provide that the copies are not made or distributed for

direct commercial advantage, the ACM copyright notice and thetitle of the publication and its date appear, and notice is giventhat copying is by permission of the Association of ComputingMachinery. To copy otherwise, or to republish, requires a feeand/or specific permission.

SPAA 94-6194 Cape May, N.J; USA@ 1994 ACM 0-89791-671 -9/94/0006..$3.50

S. Muthukrishnant

utilizes an appropriate random set of edges forming

a matching. Our algorithm for the IWM problem is

used in deriving efficient algorithms for all the prob-

lems mentioned above.

Our results are very general. The algorithms wederive are local, and hence, scalable. They work for

arbitrary load distributions and for networks of arbi-

trary topology which can possibly undergo link fail-ures. Of independent intereat is our proof techniquewhich we use to lower bound the convergence of ouralgorithms in terms of the eigenstructure of the un-derlying graph.

Finally, we present preliminary experimental r-sults analyzing issues in load balancing related to ouralgorithms.

1 Introduction

Consider the following scenario of dynamic load bal-ancing in a distributed setting. An application pro-

gram is running on a distributed network of arbitrary

topology comprising a large number of processors.

Each processor has a load of independent tasks tobe executed. The distribution of tasks is dynamicallydetermined, that is, the specific application programrunning on the machine cannot be developed with a-

priori estimates of the load distribution. The taskof dynamic load balancing is to reallocate the tasksso that each processor has nearly the same amount ofload. Of course in natural settings, the scenario is sub-stantially more demanding in that the tasks might bedynamically generated or consumed in each step and

additionally, the underlying topology might changeowing to failures in communication links. Besidesload balancing, scenarios such as the one above occurin several other guises, for example, in job schedul-

ing, adaptive mesh partitioning and resource alloca-tion problems. (These guises will be more clearly ex-plained later with examples). In each of these guises,equitable load redistribution is critical for efficient im-plementation of algorithms on both distributed andparallel computers.

Standard models for dynamic load balancing make

226

the following assumptions. (See for example [AA+93,

LM93, R91].) In one time step, each processor can

migrate load to any (possibly all) of the other pro-cessors (possibly including non-neighbors). Also, at

most one unit of load can be transferred across anylink in a step. Under this model, dynamic load bal-ancing has been extensively studied. The main focusof this paper concerns the amount of parallelism inthis standard model. Specifically,

1. The standard model overestimates available par-allelism in communication with neighbors. In

practice, in each time step, each processor cancommunicate with only one other processor.That is, the communication with a set of neigh-

boring processors is inherently sequential.

2. The standard model underestimates availableparallelism in edge capacity. With increasingly

high–bandwidth networks becoming available, alarge amount of data can be transferred acrossa link in one time step. Therefore, it is reason-

able to assume that several units of tasks canbe migrated in the same message across a linkin one step, provided moving each task incursmovement of only a reasonable amount of data.Indeed, there are large classes of important dy-namic load balancing problems in which the taskshave small associated data space (Examples areprovided in Section 1.3).

Motivated by these observations, we study dy-

namic load balancing (and other guises in which itcomes up) in distributed networks with unboundededge capacity under the restriction that each proces-

sor can migrate load to at most one of its neighbors

in one step. Our approach is to identify an abstractproblem which we call the Incremental Weight Mi-

gration Problem on arbitrary graphs. This can bethought of as a single step of load migration acrossthe entire network in parallel on our model. Our mainresult is an asymptotically optimal algorithm for this

problem on our model. We utilize this in deriving ef-ficient algorithms for many other problems, includingdynamic load balancing, adaptive mesh partitioning,

and dynamic job scheduling.

Our algorithms employ only local control and datacommunication; hence, they are scalable. Also, ourresults are very general, in that they hold for networksof arbitrary topology which can possibly undergo linkfailures during the execution of the algorithm.

1.1 Problem and Model

Incremental Weight Migration (IWM) Prob-

lem. Consider an undirected connected agent graph

G = (V, E) with n vertices in which each vertex vi

representing an agent Ai has weight wi. The poten-tial @ of the graph is defined to be (xi w?) – nZJ2,

where n is the number of nodes in G and ~ = ~i wi/n

is the average weight. 1 The IWM problem is to de-

termine a set of matching edges M (that is, no pairof edges share an endpoint) and to specify, for eachedge in M, a relocation of weights on its ends across

the edge such that the drop in the potential function

is the maximum.

Model. We note three characteristics of the modelin the definition of the problem above. First, weightmovements are local, i.e., any portion of the weight ona node which is moved, ends up at a neighbor of that

node. Second, any portion of the weight on an end-

point of an edge can be moved across the edge in onestep. Third, each agent is involved in weight transferwith at most one of its neighbors.

Desirable Algorithmic Features. Naturally wewould like an algorithm of reasonable computational

expense. This is particularly relevant since tradition-ally, load balancing and scheduling algorithms trade-

off performance for running time. Importantly, wewould like our algorithm to be fully local and dis-tributed. This is critical because algorithms whichneed and rely on global information are expensive;additionally they may not work when links in the un-derlying network graph fail, as may happen in practicefor distributed networks.

1.2 Our Main Results

Our performance measure is the ratio of the drop inpotential to the original potential; we denote this as

the convergence factor. We assume that the agents

(processors) work in lock-step.

A. Real Weight Case. First, consider the case whenthe weights are real. This implies that the weight oneach agent can be subdivided to arbitrary precision.In this case, we design a simple, completely local, ran-

domized algorithm for the IWM problem which has anexpected convergence factor of at least c+. Here, A2is the second smallest eigenvalue of the Laplacian ofthe agent graph G, d is the degree of G, and c is a

constant (O < c s 1), independent of G.

It is essy to show that there exists an agent graph

and an assignment of weights to its vertices suchthat no algorithm (even one which is randomizedand which has global information) can have a conver-

gence factor greater than CZ$, where C2 is a constant.

Therefore, our algorithm is asymptotically optimal.

1 ow po~en~j~ fwctjon @ j~ the quwe of the Euclidian&-

tance between the weight vector O = (WI, . . . . Wn)= and the

load-balanced vector (=,iiJ,, . . ..@T. Note that@ ~ 0. The PO-tential becomes zero only when the weights on the agents are

all equal to 7i7.

227

B. Discrete Weight Case. In real applications, theweights are discrete, i.e., each Wj is a collection of Wiunit tasks and a unit task cannot be divided further.We show that a simple modification of the algorithmin Case A also works when the weights are discrete.Its expected convergence factor is at least C* where c

is a constant (O < c $ 1) provided the initial potentialis f2(n4).2 This algorlthm has an optimal convergence

factor w well.

The discrete case is intrinsically harder than thereal weight case when the potential is small (o(n4))since we can show the following: there exists an agent

graph and assignment of weights to the agent nodessuch that no algorithm can have a convergence factorof more than ~~, for any constant ~. In contrast,the algorithm in Case A has a convergence factor of

Q(A2/2cf).3 On the other hand, when @ = 0(n4), wecan show that our algorithm reduces @ by an additiveterm rather than by a multiplicative factor.

Remark 1. It is worth noting that although weights

are moved only along a subset (matching) of edges,

the convergence bounds are in terms of the global

properties of the graph, namely, ~2 and d. Note thatfor any connected graph, O < ~ ~ 1. Thus, our algo-

rithms guarantee a positive fractional (possibly non-constant) decrease in the potential for any connectedgraph.

Remark 2. The parameter J2 reflects the connec-tivity of the underlying graph. For a line graph, a d-

dimensional mesh, a hypercube, a d-regular expanderand a clique on n nodes, the fraction ~ roughly equals

respectively.

Remark 3. Our algorithms are extremely efficientsince each agent takes only O(d) time for control.

Also, each agent performs data transfer across at mostone edge in each time step.

Remark 4. The performance of our algorithm isguaranteed even if some edges disappear between suc-cessive time steps; in this case, d represents the de-

gree of the graph at the beginning of the algorithmand ~2 represents the second smallest eigenvalue ofthe Laplacian of the graph that remains at the end of

the algorithm. The proof of this claim is omitted in

this paper.

2~ fact we prove this convergence factor when the initial

potential is Q(dn/A2 ). Since A2 = Q(l/n2) and d < n for auyconnected graph, it follows that the claimed convergence factorholds when the initial potential is fl(n4 ).

3Throughout this paper, we use Q(j), for a given function

j, to denote cj for some constant c.

1.3 Applications and Our Other Re-

sults

We utilize our algorithm for the IWM problem to de-rive efficient algorithms for a variety of task realloca-

tion problems. In what follows, we briefly introducethree major applications, and defer the other applica-tions to our detailed paper.

1.

2.

3.

We provide an efficient and first-known com-pletely analyzable algorithm in our model for dy-

namic load balancing on arbitrary networks underpossible link failure.

We provide the first-known, analytical conver-gence result for abstract dynamic re-partitioningproblems (eg. re-partitioning an adaptivelychanging mesh such as those used in finite ele-

ment methods).

We initiate a new paradigm of progressive taskscheduling, in which even under dynamic gen-eration and consumption of load, at each taskscheduling step, a fractional progress is made to-wards the load-balanced state.

Dynamic Load Balancing. Given a processor net-

work with arbitrary discrete weights on the vertices,

the dynamic load balancing problem is to move theloads as to have nearly the same amount of load oneach processor. Dynamic load balancing has beenstudied in a number of settings, Almost all re-

search has focused on algorithms for specific topolo-gies and/or rely on global routing phases. A classof such research has involved performance analysis ofload balancing algorithms by simulations [LMR91].Among analytical results, load balancing for specifictopologies under statistical assumptions on input load

distributions has been studied [HCT89]. For arbitraryinitial load distribution, load balancing has been stud-ied in special topologies such as Counting Networks

[AHS91, HLS92] Hypercubes [P89], Meshes [HT93]and Expanders [PU89]. These algorithms do not ex-tend to arbitrary or dynamically changing topologies.

For dynamically changing topologies, load balancinghas been studied under assumptions on the pattern offailures for specific architectures [R89, AB92].

For arbitrary topologies, under the assumption

that one load unit can be migrated across each edgein parallel and that each processor can communicatewith all its neighbors in one step, [AA+93] presents

an algorithm for dynamic load balancing which takesO(A log(nA)/p) steps to approximately balance theloads. The approximation is within an additive termof d x diameter(G). Here p is the vertex expan-sion of the graph and A = maxi (Wi – iD), where

~ = ~i w;/n. Their algorithm is optimal up to alog nA factor.

228

In our model of dynamic load balancing, load canbe moved along only one edge from a processor in atime step, and there is no restriction on the amount of

load that can be moved. This is an appropriate modelwhen each task has small associated data space; there-

fore several tasks can be communicated in one time

step. This is true for a large class of problems likefine grain programs which spawn processes dynam-

ically [GH89, K88], real time data fusion problems[CA87, FG91] and game tree searches [F93].

Clearly, the dynamic load balancing problemfor discrete weights can be solved by applyingour algorithm for the IWM problem repeatedly.

This algorithm balances load approximately in

0((d/A2)(log 00 + dn)) invocations of our algorithm

for the IWM problem. The load balancing is approx-

imate in the sense that our algorithm stops when for

each edge (i, j), I wi - wj IS 1. Our algorithm worksfor arbitrary topologies under possible failure of links

connecting the processors.

We remark that Cybenko [C89] considered astronger version of our model by additionally allowingeach processor to transfer load to all its neighbors inone time step. This work is of mathematical interest

since it considers only the case when the weights arereal.

Problem Re-Partitioning. Abstractly, assume

each node in the agent graph corresponds to a par-

tition or sub-domain of a global data domain. Each

node in the agent graph is mapped to a processor.Due to dynamic computations at each processor, thesub-domains get refined, leading to a load-imbalance

in the size of each sub-domain. Repartitioning of thedomain becomes necessary to achieve load balance.

Such applications come up in various forms inthe use of adaptive finite-element and finite-differencemethods using either locally adaptive meshes or or-der of approximation, for example in h-p jinite ele-

ment methods which are common in mechanical en-gineering and visualization software. In adaptive-

mesh terminology, the agent graph representing sub-domain connectivity information is called the quo-

tient graph. Achieving balanced sub-domains usu-ally involves shifting the boundaries of adjoining sub-domains (i.e., across edges in the quotient graph) soa to equalize the data points in each sub-domain.

Further references on these areas can be found in[BB87, HT93, W91].

Clearly, our algorithm for the IWM problem canbe used repeatedly on the quotient graph to solve the

problem of mesh partitioning. Note however that theactual migration of data points as determined by theapplication of our algorithm can be performed on theunderlying architecture by either local communication

(if adjoining sub-domains have been mapped to ad-joining processors) or by non-local routing (if adjoin-

ing sub-domains have been mapped to non-adjoiningprocessors).

Progressive Dynamic Task Scheduling. Con-sider a segment of a distributed execution, where tasks

are generated and consumed in each step in an unpre-dictable manner at various nodes as the computation

proceeds. We are required to schedule the tasks in

each step by moving them to underloaded or idle pro-

cessors so as to increase the throughput. This sce-nario arises in general purpose distributed computing[LK87, NX+85] as well as in specific applications suchas the parallel branch-and-bound search on game trees

[KZ88] and dynamic tree embedding on distributedor parallel architectures [L N+89, R91].

We initiate a new paradigm for these problems.For motivation, note that there are broadly twoparadigms for task scheduling in this scenario. Inone paradigm, the scheduling guarantees that each

processor has at least one task to execute at the endof the step. In the other paradigm, the schedulingguarantees that all processors have roughly the same

number of tasks at the end of the step, It is easy to seethat in both these paradigms, there exists a sequence

of load generation and consumption that forces any

algorithm to resort to load movement between twonon-neighboring processors to satisfy the guaranteeon load distribution. Load movement between pro-cessors which are not neighbors is an expensive op-eration. We advocate the approach of restricting al-gorithms to only perform load movements between

neighbors, but requiring a guarantee of a reasonableprogress towards the load-balanced state. In our case,the reasonable progress is a decrease in the distance

to the load-balanced state (formalized as a potential

function) by a multiplicative factor. We refer to thisas progressive dynamic task scheduling.

Using our algorithm for the IWM problem onceeach step in this case of dynamic load migration withloads generated and consumed each step, we can guar-antee that the potential drops by at least a A2/16dfactor in the expected case in each step as long as

the potential is large. Note that our algorithm isthe first known algorithm to make such a guaran-tee. In a related work [AA+93], (under the weaker as-

sumption that tasks are not dynamically generated orconsumed), a distance function (different from ours)is used to measure the progress towards the load-balanced state in several steps. However, they cannotguarantee a fractional decrease in the distance in ev-ery step since their argument involves amortization ofthe decrease in the distance over several steps.

1.4 Our Techniques

For intuition, consider solving the IWM problem withreal weights. Given a graph which is not balanced,

229

we can always pick an edge (i, j) and equalize the

weights across its endpoints. Note that since provably

decreases @ since the reduction in @ is (w? + w;) –

2(w’~wj )2, which is ~(~i–~j)2 z O. We speed up thisprocess by balancing along a matching set of edges inparallel, Note that a set of matching edges can beobtained in several ways. For example, edge-coloringthe input graph gives us a set of matchings where eachcolor defines a matching. Alternately, given graph G,we can explicitly compute the matching which gives

the maximum potential drop. All these schemes re-quire expensive computation of global information;also, they may not work when some edges disappear.

In our algorithm, we choose a random set ofmatching edges locally. The manner in which the ran-

dom matching is chosen ensures that there is a globallower bound on the probability of each edge appearing

in the matching. This property ensures global conver-

gence bounds. For choosing such a random matching,we draw upon the intuition from the very sparse phasein the evolution of random graphs [B87].

There appears to be some connection betweenour techniques for analyzing our algorithm and thoseused in analyzing rapidly mixing properties of MarkovChains [M89, AA+93]. However no formal connectionis known to us.

1.5 Organization of the Paper

The IWM problem for real weights is solved in Sec-tion 3. We extend this solution in Section 4 to the casewhen the weights are discrete. In Section 5 we demon-

strate one of the applications – namely, dynamic loadbalancing. The rest of the applications are omitted inthis paper. In Section 6 we present some preliminaryexperimental results from the implementations of ouralgorithms.

2 Preliminaries

Consider an undirected connected agent graph G =

(V, E) with n vertices and maximum degree d. Eachvertex vi represents an agent Ai and has weight Wi.We denote the distribution of weights wi on node iof G by the weight vector d. The potential @ of the

graph is defined to be (xi w?) – nZ2, where n is thenumber of nodes in G and G = ~i wi/n is the averageweight. Given G and $ and any algorithm for I WM,

let the potentials before a?d after the invocation ofthe algorithm be @ and @ . Then the convergence

factor for this algorithm is defined to be (Q – @ )/0.

Let A denote the adjacency matrix of G. Definea matrix D = di,i, where di,j = O if i # j, and di,i

be the degree of agent i. The matrix L = D – A is

the Laplacian Matrix of G. The eigenvalues of L are

o= Al< A2< .... <An.

Fact 1. G is a connected graph if and only if ~2 >0.It can be shown that for any connected graph with nvertices, AZ = Q(l/n2),

Fact 2. From the Courant-Fischer Minimax Theo-rem, it follows that [MP92],

wherevl = (1,1, ..., 1)~ is the eigenvector correspond-

ing to Al = O and z -1-VI implies that the vector z isorthogonal to V1.

3 Algorithm for IWM (Real

Weights)

In this section we present a local randomized alg~rithm for the IWM problem with real weights. Given

a graph G and weight vector d, Algorithm LocalRan-dom (LR) works as follows :

1. Pick a random matching M in G aa follows:

a. Each edge e is independently put in lbl with

probability I/p (p will be fixed later).

b. Each edge (u, v) removes itself from A4 if

(w, u) or (w, v) is in Al for some w 6 V.

2. For each edge (i, j) c M, (assuming without loss

of generality wi ~ wj ), move (wi – wj )/2 loadunits from agent 2 to agent j.

Lemma 1 For each edge e = (u, v), Pr ( Algorithm

LR picks e in ill ) z l/8d.

Proof. Fix an edge e = (u, v). Now,

Pr( e is in M in Step la and removed in Step lb)

< 12(d–1)

-i P

Therefore,

Pr(e in J4 after Step lb)

= Pr(e in Jf after Step la and it is not removed in

Step lb)

= Pr(e in &f after Step la) -

Pr(e is in A4 after Step la and removed in Step lb)

>~_~2(d–l)>p–2d+2—

PPP– P2

Now set p = 4d. Then, Pr(e in J4 after Step 2) z ~

~ ~. m

230

Theorem 1 For any connected graph G, and weight

vector$, the expected value of the convergence factor

CLR of Algorithna LRit atleast~.

Proof : Let AQbethe drop intotal potentialof G

due to algorithm LR. For each edge (i, j) in Ai,j be the

drop in potential due to weight equalization between

agents i and j if (i, j) is in the matching picked by LR.

Clearly, Ai,j = W: + W; – 2( ’”’~wz)2 = (Wi – W~)2/2.

E(AO) = ~ Pr( Edge (i, j) in M)* Ai,j

(i,j)~E

= ~ Pr( Edge (i,j) in

(i,j)~E

~lJWi-Wj)2

(i~JZ 8’ 2

> Z(i,j)~E(wi - ‘j)2

—16d

~)* (Wi - tOj)’~

2

(Lemma 1)

Recall that the convergence factor is A@/@. Note

that @ = (~i~~ w?) - nTiJ2 = ~i~v(Wi -liJ)2.

(1 ~(i,j)cE ((wi - ~) - (Wj - ~))’

‘S ~i~v (Wi - ID)2)

Define z to be a vector of length n with elements

Xi =Wi — ID. Substituting,

E(CLR)

Since ~i xi = O, x is orthogonal to the first eigen-

vector VI = (1,1, . . . . . l)T of the Laplacian matrix.

From Fact 2 in Section 2 it follows that :

Remark 1. Note that in Algorithm LR each agent

uses the global maximum degree d. The algorithm

can be easily modified such that each agent uses only

its own degree and the degrees of its neighbors. The

convergence bound of Theorem 1 still holds. Also,

the resultant algorithm is completely local. Note in

particular that determining if edge (i, j) should be re-.

moved from M could be done by communicating only

with neighbors.

Remark 2. The convergence bound in Theorem 1

is asymptotically optimal since there exists a graph

with an assignment of weights, such that no algorithm

which moves weights across a set of matching edges

can have a convergence factor of Q(A2/2d). This is

seen by considering the line graph of size n with the

real weights 1,2, . . . . . n on its vertices. It is easy to

see that any algorithm on our model cannot achieve

a convergence factor beyond 0(1/n2) for this input.

Since Az/2d = O(l/nz), the claim follows.

4 Algorithm for IWM (Dis-

crete Weights)

In this section we extend our result from Section 3 to

the case when the weights are discrete. Note that in

this case, the weights at the endpoints of an edge can-

not be equalized beyond a precision of one unit. We

modify Algorithm LR and obtain Algorithm Discrete-

Local-Random (DLR).

In Algorithm DLR a matching M is chosen lo-

cally in the same manner as the algorithm LR. The

only difference is in weight equalization for the edges

in M. Assume that an edge (i, j) has been chosen

in M and without loss of generality wi > wj. When

(wi +wj) is even, weight equalization as in Algorithm

LR suffices. But when (wi + Wj) is odd, a total of

Wi - ((wi + Wj – 1)/2) load units are moved from

agent i to agent j. Note that the new weights on

agents i and j after this transfer are (wi + wj + 1)/2

and (wi + wj – 1)/2 respectively.

Theorem 2 For, any connected graph G, and weight

vector W, the expected value of the convergence fac-

tor CD of Algorithm DLR is at [east & as long as

@ > 2dn/A2.

Proof : Let Ai,j (even) denote the drop in poten-

tial due to weight transfer across edges (i, j) where

(Wi+wj) k even. Let Ai,j(dd) denote the drop in Po-

tential due to weight transfer across edges (i, j) where

(wi +wj) is odd. Note that Ai,j(even) = (wi – Wi)2/2and Ai,j(odd) = ((wi – wj)’ – 1)/2. Note,

~(i,j) Pr((ij) E M) * Ai,j(even)E(CD) =

@

By Lemma 1,

(

~(i)j)~E(wi - ‘j)’ -1E(CD) > &

@)

(

1 ~(i,j)cE (Wi – ‘j)z _ ;

‘ma ~~=~ (Wi - =)’)

231

where e ~ nd is the number of edges in G. As in The-

orem 1, the first term above is at least AZ. Therefore,

E(CD)’> ~ – ~. Clearly, when @ = t2(2dn/JZ)l

we have E(CD) = $2( A2/32d). B

Theorem 3 Consider an agent graph G with n ver-

tices and degree d, and weight vector U, such that po-

tential @ = O(dn/~2). The expected value of A@D is

Q(lfd), where A@D is the random variable denoting

the decrease in potential by using Algorithm DLR.

Proof : We proceed as in the proof of Theorem 2.

As in the proof there, using the values of Ai,j (even)

and Ai,j (odd), we have

(’Wi- Wj)2E(A@D) = ~ Pr((i, j) E Al)* ~

.(i,j)

+ ~ Pr((i, j) 6 ill) *((W, - Wj)2 -1)

(i,])2

As long as there exists an edge (i, j) such that

IWi–Wj122, we get A@D is fl(l/d) using Lemma 1.

Note that, if for every edge (i, j) C E, I Wi – Wj IS

1, then the load–balanced state has already been

reached. 9

Remark 1. The claim in Remark 2 in Section 3 con-

cerning the lower bound on the convergence factor

applies here as well. That is, Algorithm DLR hss

an asymptotically optimal convergence factor when

@ = Q(2dn/~2).

Remark 2. When @is 0(2dn/A2), we can show that

no algorithm can guarantee a convergence factor of

$2( A2/2d) for every input graph and weight distribu-

tion G. For proof, consider a line graph of size n with

the weights 1,2 , . . ..n–l. n+l on the agents. We omit

further details of the argument.

5 Applications

As mentioned earlier, our algorithm for the IWM

problem can be used to obtain efficient algorithms for

a variety of problems like dynamic load balancing,

problem re-partitioning and progressive task migra-

tion. In this section we describe the application to

dynamic load balancing; description of our other ap-

plications are omitted here. Formally, the problem of

dynamic load balancing problem is as follows: given

processor graph G and discrete load distribution G of

the processors, move loads units along matching links

in G in each time step so as to balance the load on

each processor. The balanced state reached here is

that in which, for each edge (i, j), the load difference

I wi – wj I is either O or 1. We can solve the dynamic

load balancing problem by repeatedly invoking Al-

gorithm DLR and moving incrementally towards the

load-balanced state. This yields,

Theorem 4 The dynamic load balancing problem for

discrete weights can be solved by invoking Algorithm

DLR O((d/Az)(log 130 + dn)) times, with high proba-

bility y.

Proof: Let @o be initial potential and let ~k be

the random variable denoting the potential of pro-

cessor graph G after the kth invocation of Algo-

rithm DLR. Henceforth, we consider only the case

when 00 is $2(dn/Jz); the case when @o = O(dn/Az)

can be handled similarly. By Theorem 2, as long

= ~k = Q(dn/~z), the (k + l)th invocation of the

Algorithm DLR decreases the potential by a factor

of-y= f2(~z/16d) in the expected case. There-

fore Algorithm DLR is invoked O((d/A2) log @o) times

in the expected case before the potential becomes

O(dn/Az).4 After this happens, by Theorem 3, each

invocation of DLR decreases the potential by an ad-

ditive l/d term in the ex ected case. Therefore Algo-!

rithm DLR is called O(Z (log @o + dn)) times totally

in the expected case before the load–balanced state is

reached.

It remains for us to show that with high probabil-

ity, Algorithm DLR is invoked 0((d/A2)(log 00 + dn))

times. In what follows, we show that it is invoked

0((d/A2)(log @o)) times with high probability before

the potential drops below 0(dn/A2); a similar high

concentration result can be proved on the number of

invocations of DLR when the potential becomes at

most O(dn/&).

Let ml, mz, . . . mr be the random variables rep-

resenting the matchings identified by the algorithm

respectively in invocations 10 c. r. Note that these

random variables are independent. Let Em, (@k) be

the expected value of %’A! computed over all possible

=signments to mk. We claim,

1. &nk(@~) < (1 ‘7)@k-l.

The first claim is easy to see. To see the second claim,

note that by definition,

~(~k) = ~m1,m2,,..,mk(@k)= Em1(Em2(. . .En,_l (~mk(@~))) 00 ,))

< (1 – 7) E~1(E~2(. . .Em._, (iO~-1)) . . .))

(by Claim 1)

By repeating this k – 1 more times, ~(@k) < (1 –

Y)k 130, proving the claim.

4The number of times Algorithm DLR is invoked is at most

*-+1O,%.

232

Given the two claims above we now show that

with high probability, @2k ~ (1 – y)k@o. Note, by

Markov’s Inequality, Pr(@2k > B) ~ w. Choose

B = (l–y)k@o and, Pr(@2k > (l–~)k@O) < (l–y)k.

Clearly, Pr(@z~ > (1 – y)~Qo) is exponent~lly small

in k (recall that ~ < 1). That proves the claim and

also the theorem. m

6 Experimental Observations

An initial assessment of repeated invocations of Al-

gorithm LR and Algorithm DLR for the problem of

dynamic load balancing was obtained through simula-

tion and experimentation on processor graphs of dif-

ferent sizes and connectivities. The load on each pro-

cessor was chosen uniformly and randomly from the

interval (O, a) for various values of a; in what follows,

such a load distribution is denoted by Random(O, a).

We discuss two important issues and present a small

sample of our experimental data,

1. Predicted versus Observed Convergence

Factor: RecaIl that our main results provide only

a lower bound on the convergence factors of our algo-

rithms. The observed average convergence factor of

Algorithm LR was consistently better than the pre-

dicted bound across all load distributions and graphs

that were considered. This is not surprising since the

lower bound on the convergence factor can be attained

only for a very restricted class of weight assignments

to the processors.

Define R to be the ratio of the experimentally ob-

served convergence factor of Algorithm LR to its theo-

retical lower bound of A2/ 16d. Figure 1 shows the plot

of R (averaged over 20 runs of LR) versus the number

of invocations of LR. Observe that the average of the

ratio R decreases with the increasing edge-density ojf

the graph. That is, the observed convergence factor

is significantly more than the theoretical lower bound

for sparse graphs and the two factors are comparable

in the case of dense graphs.

2. Real versus Discrete Loads: The simplicity of

our analysis is based on the fact that the case when

the weights are discrete behaves much the same way

zs the case when the weights are real as long as the po-

tential is large, More precisely, we showed in Section

4 that as long as the potential is larger than 2dn/A2,

the convergence factor of Algorithm of DLR is at least

one-half that of the theoretical lower bound on the

convergence factor for Algorithm LR.

The following are the results of our experiments

to study how closely the convergence factors of the

Algorithm DLR and Algorithm LR behave. Figures

2 and 3 show the decrease in potential (averaged over

20 runs) for 80 invocations of Algorithm LR and Algob

rithm DLR on two graphs of 64 nodes each. It is e~,y

to see that the convergence factor of Algorithm LR

becomes twice as much as that for Algorithm DLR

only at a potential much smaller than the theoreti-

cal bound of 2dn/A2. In all our experiments, we ob-

served that the case when the weights are discrete

behaves much the same way as the case when the

weights are real, even when the potent ial is consider-

ably smaller than the value depicted in our analysis,

namely, 2dn/A2.

A better evaluation of our algorithms can be ob-

tained from analyzing real applications on parallel and

distributed platforms. We are pursuing such a direc-

tion for the problem of dynamic adaptive mesh parti-

tioning on distributed memory machines.

7 Discussion

Recently we have considered the related model of load

movement in which at most one unit of load can be

moved along an edge; the load is restricted to be

moved along a mat thing set of edges as before. We

call this the unit-capacity model. We have modified

our Algorithm DLR from Section 4 to work in the

unit-capacity model and extended our analysis from

there. An important result we derive is the following.

Theorem 5 Algorithm DLR for the IWM problem

can be modified to work on the unit-capacity model

for synchronous networks under possible edge failures

such that the dunamic load balancing problem can be

s“’”ed’4$@+w2)‘nv”ca’ionsto’hiss’-gorithm whe~ the initial pote~tial +0 = L?((fi)2).

As a result we derive an algorithm which converges

to the load balanced state provably faster (when the

initial potential is sufficiently large) than the fastest

algorithm previously known [AA+93] for this prob-

lem. In fact our algorithm works on a weaker model

than the one in [AA+93]. However, our algorithm is

randomized while the algorithm in [AA+93] is deter-

ministic. We defer further details to the full paper.

8 Acknowledgements

Thanks to Ravi Boppana, Stan Eisenstat, Laszlo Lo-

vasz, Eric Mjolsness and Martin Schultz for their crit-

ical feedback and encouragement during this work.

Bibliography

[AA+93] W. Aiello, B. Awerbuch, B. Maggs, and

S. Rae. Approximate Load Balancing on

Dynamic and Asynchronous Networks. In

233

[AB92]

AHS9

B87]

[BB87]

[CA87]

[C89]

[F93]

[FG91]

[GH89]

[HCT89]

[HLS92]

Proc. of 25th ACM Symp on Theory of

Computingj 632-641, May 1993.

Y. Aumann and M. Ben-Or. Computing

with Faulty Arrays. In Proc of 2Jth ACM

Symp on Theory of Computing, 162-169,

May 1992.

J. Aspnes, M. Herlihy and N. Shavit.

Counting Networks and Multiprocessor co-

ordination. In Proc of 28rd ACM Symp on

Theory of Computing, 348-358, May 1991.

B. Bollobas. Random Graphs. Academic

Press, New York. 1987.

M. J. Berger and S. H. Bokhari. A Parti-

tioning Strategy for Nonuniform Problems

on Multiprocessors. IEEE Trans. on Com-

puters, Vol. C-36, No. 5, 570-580, 1987.

G. Cybenko and T. G. Allen. Parallel Al-

gorithms for Classification and Clustering.

In Proc. SPIE Conference on Advanced Ar-

chit ectures and A lgordhms for Signal Pro-

cessing, San Diego, CA 1987.

G. Cybenko. Dynamic Load Balancing

for Distributed Memory Multiprocessors.

Journal of Parallel and Distributed Com-

puting, Vol. 2, No. 7, 279-301, 1989.

R. Feldmann. Game Tree Search on Mas-

sively Parallel Systems. PhD Thesis, Dept.

of Mathematics and Computer Science,

University of Paderborn. August 1993.

M. Factor and D. Gelernter. Software Back-

planes, Realtime Data Fusion and the Pro-

cess Trellis. Research

Report YALE U/D CS\TR-852, Yale Com-

puter Science Department, March 1991.

B. Goldberg and P. Hudak. Implementing

Functional Programs on a Hypercube Mul-

tiprocessor. In Proc. of the Jth Conference

on Hypercubes, Concurrent Computers and

Applications, Vol. 1, 489-503, 1989.

J. Hong, M. Chen and X. Tan. Dynamic

Cyclic Load Balancing on Hypercubes. In

Proc. of the ./th Conference on Hypercubes,

Concurrent Computers and Applications,

Vol. 1, 595-598, 1989.

M. Herlihy, B. Lim, and N. Shavit. Low

contention load balancing on large-scale

multiprocessors. In Proc. of Jth ACM Symp

on Parallel Algorithms and Architectures,

219-227, 1992.

[HT93]

[K88]

[KZ88]

[LK87]

[LM93]

[LMR91]

[LN+89]

[M89]

[MP92]

[N92]

[NX+85]

A. Heirich and S. Taylor. A Parabolic

Theory of Load Balance. Research Re-

port Caltech- CS- TR-93-25, Caltech Scal-

able Concurrent Computation Lab, March

1993.

L. V. Kale. Comparing the Performance

of Two Dynamic Load Distribution Meth-

ods. In Proc. of International Conference

on Parallel Processing, Vol. 1, August 1988.

R. Karp and Y. Zhang. A randomized par-

allel branch-and-bound procedure. In Proc.

of 20th ACM Symp on Theory of Comput-

ing, 290-300, 1988.

F. C. H. Lin and R. M. Keller. The Gradi-

ent Model Load Balancing Method. IEEE

Transactions on Software Engineering, Vol.

13, No. 1, 32-38, 1987.

R. Lueling and B. Monien. A Dynamic Dis-

tributed Load Balancing Algorithm with

Provable Good Performance. In Proc. of

5th ACM Symp on Parallel Algorithms and

Architectures, 164-172, 1993.

R. Lueling, B. Monien and F. Ramme.

Load Balancing in Large Networks: A

Comparative Study. In Proc. of IEEE Symp

on Parallel and Distributed Computing,

Dallas, 1991.

T. Leighton, M. Newman, A. Ranade and

E. Schwabe. Dynamic tree embedding on

butterflies and hypercubes. In Proc. of Ist

ACM Symp on Parallel Algorithms and Ar-

chitectures, 224-234, 1989.

M. Mihail. Conductance and Convergence

of Markov Chains - A Combinatorial Treat-

ment of Expanders. In Proc. of 30th IEEE

Symp on Foundations of Computer Sci-

ence, 526-531, October 1989.

B. Mohar and S. Poljak. Eigenvalues in

Combinatorial Optimization. Research Re-

port 92752, IMA, Minneapolis, 1992.

D. Nicol. Communication Efficient Global

Load Balancing. In Proc. of Scalable High

Performance Computing Conference, 292-

299. Williamsburg, VA. April 1992.

L. M. Ni, C. W. Xu and T. B. Gendreau.

Drafting Algorithm - A Dynamic Process

Migration Protocol for Distributed Sys-

tems. In Proc. of Int. Conf. on Distributed

Computing Systems, 539-546, 1985.

234

[P89]

[PU89]

[R89]

[R91]

C. G. Plaxton. Load Balancing, Selection

and Sorting on the Hypercube. Proc. of Ist



D. Peleg and E. Upfal. The token distri-

bution problem. SIAM J. on Computing,

Volume 18, 229-243, 1989.

M. O. Rabin. Efficient dispersal of informa-

tion for security, load balancing and fault

tolerance. Journal of the A CM, Vol. 36, No.

3, 335-348, 1989.

A. Ranade. Optimal speedup for backtrack-.

search on a butterfly network. In Proc. Jlrd



[RSU91] L. Rudolph, M. Slivkin-Allalouf, and

E. Upfal. A simple load balancing scheme

for task allocation in parallel machines. In

Proc. $rd ACM Symp on Parallel Algo-

rithms and Architectures, 237-243, 1991.

[W91] R. D. Williams. Performance of dynamic

load balancing algorithms for unstructured

mesh calculations. Concurrency: Pracice

and Experience, Vol. 3, No. 5, 457-481,

1991.

91 , . . I

.......!=;51.........q ; i : w:’]

—.Number alhwacdorn

Figure 2: Plot of the natural logarithm of the poten-

tial versus the number of invocations of Algorithms

LR and DLR on a 64-node Random Graph with edge

probability 0.5. The values of AZ and d were deter-

mined to be 38 and 56 respectively. The predicted

potential was calculated using the theoretical lower

bound on the convergence factor for the case when

the weights are real. Initial load distribution was

Random(O, 30).

Figure 1: Plot of ratio R (defined in Section 6) versus

the number of invocations of Algorithm LR on a Hy-

percube (*), a Random Graph with edge probability

0.5 (+), and a Clique (x), each with 64 nodes. The

value of R for the three graphs averaged over the invo-

cations were 4.16, 2.05, and 1.46 respectively. Initial

load distribution was Rarzdom(O, 100).

Figure 3: Plot of the natural logarithm of the poten-

tial versus the number of invocations of Algorithms

LR and DLR on a Hypercube with 64 nodes. Here

~z = 2 and d = 6. Again the predicted potential was

calculated using the theoretical lower bound on the

convergence factor for the case when the weights are

real. Initial load distribution was Random(O, 30).

235

Dynamic Load Balancing in Parallel and Distributed ...

Documents