Page 1
Task Clustering and Scheduling to Multiprocessors with
Duplication
Li Guodong Chen Daoxu Wang Darning Zhang Defu Dept. qf Computer Science, Nanjing University, Nuqjing 210093, China
([email protected] )
Abstract Optimul tusk-duplication-based scheduling of tusks
represented by u directed ucyclic gruph (DAG) onto
u set of homogenous distributed memory processors,
is a strong NP-hard problem. In this puper we
present a clustering und scheduling ulgorifhm with
time complexity O(v310gv), where v is the number of
nodes, which is able to generate optimal schedule
for some specljk DAGs. For urhifrury DAGs, the
schedule generated is at most two times as the
optimul one. Simulution results show that Ihe
performunce of TCSD is superb to those of four
renowned algorithms: PX TDS, TCS und CPFD.
1. Introduction
Task scheduling problems are NP-complete in
the general case [I]. The non-duplication task
scheduling problem has been extensively studied
and various heuristics were proposed in the
literature 12-41. Comparative study and evaluation
of some of these algorithms have been presented [S].
These heuristics are classified into a variety of
categories such as list-scheduling algorithms [2],
clustering algorithms 131, guided random search
methods 141. A few research groups have studied the
task scheduling problem for the heterogeneous
systems [6]. Duplication based scheduling is a
relatively new approach to the scheduling problem.
There are several task duplication based scheduling
schemes [7-161 which duplicate certain tasks in an
attempt to minimize communication costs. The idea
behind duplication-based scheduling algorithms is
to schedule a task graph by mapping some of its
tasks redundantly to reduce the inter-task
communication overhead. They are usually for an
unbounded number of identical processors and have
much higher complexity values than their
alternatives.
Duplication based scheduling problems have
been shown to be NP-complete [lo]. Thus, many
proposed algorithms are based on heuristics. These
algorithms can be classified into two categories in
terms of the task duplication approach used 1211:
Scheduling with Partial Duplication
(SPD)[9][12][13][15] and Scheduling with Full
Duplication (SFD)[7][8][10][11][14][16]. SPD
algorithms only duplicate limited number of parents
of a node to achieve low complexity while SFD
algorithms attempt to duplicate all the parents of a
join node. When the communication cost is high,
the performances of SPD algorithms are low. SFD
algorithms show better performance than SPD
algorithms but have a higher complexity. Table 1
summarizes the characteristics of some well-known
duplication based scheduling algorithms.
Among these algorithms, CPFD algorithm
0-7695-1926-1/03/$17.00 (C) 2003 IEEEProceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)
Page 2
achieves the shortest makespan and uses the
relatively few processors, but at each step, it spends
prohibitively long time on testing all candidate
processors and scanning through the whole time
span of each processor.
Table 1. Task duplication scheduling algorithms
This paper proposes a novel SFD algorithm
called TCSD whose simulation study shows that the
proposed algorithm achieves considerable
performance improvement over existing algorithms
while having less time complexity. Theoretical
analysis shows that TCSD matches the lowest
bound of schedule achieved by other algorithms so
far. In addition, the number of processors consumed
by destination clusters is substantially decreased
and the performance degradation of TCSD
according to the number of processors available is
agreeable.
2. The Proposed TCSD Algorithm
2.1 Model and Notations
A parallel program is usually represented by a
Directed Acyclic Graph (DAG), which is defined by
the tuple (V, E, z, c), where V, E, z, c are the set of
tasks, the set of edges, the set of computation cost
associated with the tasks, and the set of
communication costs associated with the edges,
respectively. The edge ei.i E E represents the
precedence constraint between the task vi and v;. ri
is the computation cost for task vi and Ci.i is the
communication cost for edge ei.i. When two tasks, vi
and v;, are assigned to the same processor, Ci.i is
assumed to be zero since intra-processor
communication cost is negligible. Multiple entry
nodes or exit nodes are allowed in the DAG. Figure
1 depicts an example DAG. The underlying target
architecture is assumed to be homogenous and the
number of processors to be unbounded.
The term iparent is used to represent immediate
parent. The earliest start Ll Ll time, esti, and the earliest
completion time, t?Cti, are
the earliest times that a
task vi starts and finishes
its execution respectively.
A message arriving time
from V; to vi, matj,i, is the
time that the message
from V; arrives at Vi. If Vi
and vi are scheduled on the Fig. 1. Example DAG
same processor, matj,i becomes
ect;; otherwise, mat;,i = ect; + Ci,i. For a join node vi,
its arriving time l7Zclti = max {m&;,i / V; is Vi’s
iparent}. In addition, its critical iparent, which is
denoted as CIP(vJ, provides the largest mat to the
join node. That is, v; = CIP(VJ if and only if mat;,i 2
ma&i for all k where vk is the iparent of Vi and k # j
(if multiple nodes satisfy this constraint, arbitrarily
select one). The critical iparent of an entry node is
defined to be NULL. Among all vi’ iparents residing
on other processors, RIP(Y) is the iparent of vi
whose mat is maximal (arbitrarily select one for
breaking tie). Clearly, when CIP(vJ and vi are not
assigned onto the same processor, CIP(vJ = RIP(Y).
After a task vi is scheduled on a processor PE(vi),
the est of vi on processor PE(vJ is equal to the
actual start time, ast(i,PE(vJ), or l%Sfi if confusion
will not be incurred, of task vi. After all tasks in a
graph are scheduled, the schedule length, which is
also called makespan, is the largest finish time of
exit tasks. The objective of the scheduling problem
0-7695-1926-1/03/$17.00 (C) 2003 IEEEProceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)
Page 3
is to determine the assignment of tasks such that
minimal schedule length is obtained. A clustering of
DAG is a mapping of the nodes in V onto
non-overlapping clusters each of which contains a
subset of V. A schedule Si for a cluster C(vi) is
optimal if for each other schedule Si’ for C(vi),
makespan(SJ < makespan(Si’). The makespan of a
cluster C(vi) is defined as the makespan of its
optimal schedule. In addition, a schedule S is
optimal for a clustering ‘?? if for every other
schedule S’ for Y, makespan < makespan(
2.2 TCSD
The basic idea behind TCSD is similar to the PY
algorithm and TCS algorithm in such a way that
clustering is constructed dynamically, but TCSD
differ from them in selecting parent nodes to be
duplicated and computing start times of nodes in the
cluster under investigation. For each vi E V, we first
compute its esfi by finding a cluster C(vi) which
allows vi to start execution as early as possible
when all nodes in C(vi) are executed on the same
processor.
The esf values are computed in topological order
of the nodes in V. A node is processed after all of its
ancestors have been assigned est values. Consider
the cluster C(vJ and an arc ern.n crossing C(vJ such
that v, belongs to C(vJ while v,, doesn’t, obviously
esf, 2 est,, + z,, + c,,,, If v, and v,, are on the
critical path (at this time ern.n is called the critical
edge), then we attempt to cut the cost of ern.n down
to zero by assigning v,, and v, to the same
processor.
At each iteration, TCSD first attempts to identify
the critical edge of current cluster C(vJ, then
tentatively absorb into C(vJ the node from which
the critical edge emanates, e.g. v,, to decrease the
length of current critical path. However, this
operation may increase the cluster’s makespan even
if v,’ start execution at est,. In this case, previous
absorbing is canceled.
Before a critical edge em. n is absorbed into C(vi),
mut, ‘= est,+ znl+cn13 m After the insertion of vm, mat,,,
=max( (esf,+z, / ep.m6E and vpEC(vi)),(eStp+Zp+c~.,~
/ ep.mcE and vpEC(vi)). Denote the new mat of v,
after v,,‘s insertion as mat,, obviously the constraint
of maf,‘&nuf, must holds after v,,‘s insertion.
Furthermore, suppose v,=RIP(v,,), then mat, 2
mClt,~+Zm2esfp+Zp+-Cp.m+Zm, it follows that est,,+c,, 2
esfp+7p+cp.m. In general, assumed that vi=RIP(v&
v2=RIP(v,), . . . , ~n=RlP(vn.~), VoGC(Vi), VlPC(VJ,
V*PC(Vi), ...) vnPC(vi), after vi is absorbed into
cluster C(Vi), V2, V3, . . . , vk must be also absorbed
into C(vi) if the following equation is true:
k
es& +cl o < estk + t: z/ +Ck,k-l (1)
1=2
TCSD inserts these nodes into C(vi) in one step
rather than inserts them one by one to save the
running time. Here we introduce a term called
snapshot to model the temporary status of the
obtained cluster. Formally, a snapshot of C(vi), i.e.
SP(vi), is defined by a tuple (Vin, E,,, V,,t, Tin),
where Vim L, V,,t, Tin are the set of nodes already
in C(vi), the set of edges crossing C(vi), the set of
nodes from which edges in E,, emanate, the ests
associated with nodes in Vin, respectively. The edge
ei,i in E,, represents the precedence constraint
between the task vi in Vi, and the v; in VOUt.
Procedure Calcualte-EST(Y) constructs C(vJ
and calculates the value of esfi by altering the SP(vJ
gradually. For all entry nodes, their ests are zero.
Initially the critical path of C(vJ includes the edge
connecting vi and its CIP v;, thus the initial value of
esfi is est; + Ci,i and the initial critical edge is e;,i.
This procedure iteratively absorbs the node on
current critical edge (also called critical node) into
C(vJ, and then finds a new critical edge until the
schedule length of C(vJ cannot be improved any
more. Note that at each iteration, we first assume
that current critical node, i.e. v,, is scheduled on
PE(vJ and starts execute at es&. If such absorbing
cannot result in earlier start time of vi, this
procedure terminates and returns the set of nodes in
0-7695-1926-1/03/$17.00 (C) 2003 IEEEProceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)
Page 4
Vi, excluding v,; otherwise it attempts to insert v,
into Vi, and schedules v, to start execution at its ast.
Then v,‘s ancestors satisfying equation (1) will be
inserted in Vi, too. After all these ancestors are
identified and inserted into Vin, we re-compute their
mats. Then a new critical edge is identified by
calling compute-snapshot-critical-path (SP(vi)). If
the updated critical path’s length is less than vi’s
original est, then esfi is modified to be the new one.
Algorithm. Calcualte-EST (Vi)
begin
if Vi is an entry node then return esti=O and
C(Vi)=lVil;
Let vj=CIP(vi), esti=eclj+cj--i and critical-edge = ej.i;
For current SP(Vi), let Vi,={Vi}, V,,+ and E,,=Q;
Repeat {
Denote current critical-edge as e,.,
VintV,, st,=esl,; At, is v,‘s release time *I
esti”=compute_ criticalgath(SP(vi));
if esti”>e.~ti then return C(VJ and erri;
Let v,=CIP(v,) and span_sum=est,,,+c,.,
Initialize camp-sum to be zero, i.e. comp_sum=O;
repeat {
comp_sum=comp_ sum+z,;
if span-sumie,+c,.,+comp_sum then {
Ecs=Ecs-Cep.x I e,.,eE and VxEVinl;
Vin+Vp;
Vou~=Wou~-C~p~) u CV, I v~.~EE and VqPVin>;
Ecs’Lule,.y I e,.pEE, e,,sE and VyEVinl;
Let v,=vr and v,=CIP(v,);
I
until vP is NULL;
esli’=compute- criticalgath(SP(vJ);
&Sli=min{e.Yti, e.rli’};
if esli’&sti then C(Vi)=Vi,;
until critical_edge=NULL;
end
Given the melt of each node in C(vi), procedure
compute-criticalpath(C(vi)) computes the length
of critical path of C(vJ and identifies the new
critical edge. Procedure compute-node-mat(v,,
SP(vi)) calculates the new mats of nodes in C(vi).
Note that, when the mat of each node in C(vi) is
available, the optimal schedule of these nodes on
one processor to achieve minimal finish time of vi is
to execute the nodes in nondecreasing order of their
mats.
Procedure compute-criticalpath (SP(vi))
begin
Unmark all nodes in the Vi, of SP(vJ;
while there is a unmarked node in Vi, do {
Select an arbitrary unmarked node v,;
Call compute-node_mat(v,,SP(vi)) to compute mat,;
I
Supposed Vi, contains VI, va, . . . . v, (in decreasing
order of their mats).
Let astl=matI and critical-edge = NULL;
forj=2tondo{
if aS.l+Tj.l<matj then let astj=matj and
critical edge=earp(Gj.j; - else asli = asli. r+ti. r ;
I
return the critical-edge and the schedule length which
is equal to ast,;
end
Procedure compute-node-mat (vx, SP(vi))
begin
if v, is an entry node then mark v, and return 0;
if not all v,‘s iparents are marked then {
for each v,‘s unmarked iparent, i.e. vr, compute the
mat, by calling compute-node_mat(v,, SP(Vi));
Compute v,‘s message arriving time mat,. Note that if
vp is in Vi,, mat,., is equal to ect,; otherwise,
matp.x=ectp+cp.x;
mark v,;
return mat,;
end
The running trance in figure 3 illustrates the
working of TCSD for the DAG in figure 1.
The following algorithm constructs Y(G) by
0-7695-1926-1/03/$17.00 (C) 2003 IEEEProceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)
Page 5
visiting the nodes in G in reverse topological order.
Initially Y(G) = 0 and all exit nodes are marked.
Then we add clusters into Y(G) in the descending
order of their makespans. After a cluster C(vi) has
been scheduled onto a processor, if the actual start
time of a node in C(vi), e.g. v;, is less than that
stored in cur ast;, then this new value will substitute -
the old one. For any edge crossing C(vi) such as em. Tb
if a.~,,, - es&, > makespan - ecti, then mark es&,,
and assign C(v,,) to a new processor; otherwise, vn
receives data from the copy of v,~ starting execution
at asf,,. This eliminates the need for consuming a
new processor.
Algorithm. Duplication-Based-Schedule (DAG G)
begin
Compute the esl of each node by calling the procedure
Calcualte-EST.
Mark exit nodes in V and unmark all other nodes in V;
Initialize the value of each node’s cur-ast to be
makespan(
while there exists a marked node do
Select the marked node with largest ect, i.e. vi;
Add C(VJ into Y(G) and unmark Vi;
for each node in C(Vi), i.e. Vj, do {
Supposed the actual start time of Vj in C(Vi) is
asui7, PWJ),
Let cur-astj=min{asl@, PE(vJ), cur_astj};
1
for each edge crossing C(VJ, i.e. em.ra do {
if asl,-est,,,lmakespan(Y)-ecti then mark estj;
end
The schedule generated by algorithm
Duplication-Based-Schedule is shown in figure 4.
The first cluster being added into Y(G) is C(vJ,
then follows C(v,). Note that v5 and vg in C(v,) can
receive data from their iparents v1 and v4 in C(vlo)
respectively. Finally C(v,) and C(v,) are inserted
into Y(G).
Assumed that the number of nodes in G is v,
algorithm Duplication-Based-Schedule has
complexity O(v*) where O(v) comes from updating
the value of cut-asts and identifying next cluster.
The complexity of procedure compute-node-mat is
O(vlogv) where O(logv) comes from computing the
new mat of a node in C(vi). Procedure
compute-criticalpath runs O(vlogv) time. And, the
worst case is that v nodes are absorbed into C(vi),
making Calcualte-EST run O(v*logv) time. Hence,
the overall time complexity of constructing v
clusters is O(v”logv).
Lemma 1. If there are multiple nodes in cluster C(vi)
retarding their start time from original mats to new
mats, the incurred delay of makespan(C(vJ), i.e. Di,
satisfies
where mat, and mat,’ are the original value and
the new value of mat, respectively.
ProojI The proof of this (and following theorems) is
neglected due to limited space.
Theorem 1. Provided that any edge crossing C(Vi),
I.e. ern.n satisfies ast,, - est,, < makespan - ecfi,
v,, can start execution at time as&, while Y’s
makespan will not increase.
Theorem 1 justifies the operation of saving
processors in Duplication-Based-Schedule.
2.3 Properties of the Proposed Algorithm
Theorem 2. For out-tree DAGs (i.e. ordinary trees),
the schedules generated by TCSD are optimal.
Lemma 2. Considering a one-level in-tree (i.e.
inverse ordinary tree) consisting of vl, v2, . . . , vi-i, vi
such that vl, v2, . . . . vi-1 are the iparents of vi and
have individual release times esfi, est2, . . ., esti.1.
Provided that e.Qi+ri+ci.i 2 eSt2+r2+c2.i 2 . ..>
esfi.i+ri.i+Ci.i.i, TCSD generates an optimal schedule
whose length is equal to max(makespan( {vi, v2, . . . ,
Vi)), est;+l+Cjil_i} + xi
0-7695-1926-1/03/$17.00 (C) 2003 IEEEProceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)
Page 6
r
cst,=? * i 2
cst,= IO 7 * I
I-i
5
7 * x
4
* I /I
4
L=?l
cst3=5
c sti
cstc,=8
3
cst3=5
=>
Fig. 3. Computing the estof each node,
where makespan( (vl, ~2, . . ., vj>) 5 rj + cg and
makespan(( v~,v;?, . . ..V~.V&) 'T~+C-.
When estl, est2, . . . , esti-1 are all equal to zero, this
problem is reduced to the case of general single-level
in-tree which has been investigated in previous
researches[ 111.
Theorem 3. For fork-join DAGs (diamond DAGs),
the schedules generated by TCSD are optimal.
Then we adopt the definition of DAG granularity
given in [3][ 141.
Theorem 4. For coarse-grain DAGs, the schedules
generated by TCSD are optimal.
Theorem 5. For arbitrary DAGs, the schedules
generated by
TCSD are at most
twice as the
optimal ones.
Moreover, if the
granularity of
DAG is larger
than (1+)/s for
O<ESI, the
schedule length
generated is at
most (l+a) times
as the optimal
one.
PE3
Fig. 4 Schedule generated by TCSD
for the DAG depicted in Fig.1
0-7695-1926-1/03/$17.00 (C) 2003 IEEEProceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)
Page 7
3. Performance Evaluation
In this section we compare the performance of
TCSD with four existing scheduling algorithms, i.e.
CPFD, PY, TCS and TDS. Comparing results of some
of these algorithms with other algorithms including
LC, BTDH, LCTD, DSH, LWB, etc., can be found in
literature [11][16].
These four algorithms along with TCSD are applied
to diverse sets of application with varying
characteristics. In particular, the input DAGs are
generated randomly with the number of tasks ranging
from 200 to 800 nodes and the number of edges
varying from 2,00 to 4,000. Additionally, the number
of predecessors and successors varies from one to 100,
and the computation costs vary from one to 1,500 time
units. And, the CCR of a DAG is defined as its
average communication cost divided by its average
computation cost. In the test the CCR values used are
0.1, 1.0, 5.0, and 10.0.
We generate 2000 random DAGs for testing. The
first comparison is to compare the schedule lengths of
four algorithms with TCSD in terms of the normalized
schedule lengths (NSLs), which is obtained by
dividing the output schedule length by the sum of
computation costs on the critical path [ 1 1 ][ 131. Table 2
shows the average NSLs produced by five algorithms.
TDS is a SPD algorithm with relatively low time
complexity, but its schedule length is much longer
than those of its counterparts. TCS and PY behavior
rather similarly and achieve approximate performance
in terms of schedule lengths. CPFD is the most
time-consuming one but produce better schedule
results than TDS, TCS and PY. However, TCSD
outperforms all these algorithms in performance
metric, and its running time is logv times less than that
of CPFD.
Algorithms are generally more sensitive to the
value of CCR than to the number of nodes. Table 3
shows the ratio of makespan generated by TCS, TDS,
PY as well as CPFD over TCSD. It may be noted that
differences between the performances of various
algorithms become more significant with larger value
of CCR.
Table 2. Average NSLs of five algorithms for
random DAGs with various numbers of nodes
Table 3. Ratio of makespan generated by
Table 4 shows the result of the comparison between
each pair of algorithms. Each entry of the table
consists of three elements in “>x, =y, <z” format,
which means that the algorithm in a row provides
longer parallel time x times more than, same parallel
time y times as, and shorter parallel time z times more
than the algorithm in that column. For instance, among
1000 cases, TCSD outperforms CPFD in 503 cases,
achieves the same makespans as CPFD in 352 cases.
There are 145 cases in which TCSD is inferior to
CPFD in terms of performance.
Table 4. Algorithm comparison in terms of
better, worse, and equal performance
Number
of Nodes I Tcs I TDS I py
CPFD
>488,
=352,
~160
0-7695-1926-1/03/$17.00 (C) 2003 IEEEProceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)
Page 8
c ~effoyay Analysis yf a Compile-yme Optimization Approach for List Scheduling Algorithms
on Distributed-Memory Multiprocessors”, Proc.
Supercomputing ‘92, Nov. 1992, pp.5 12-521. 4. Conclusion
[S] B. Kruatrachue and T. G Lewis, “Grain Size
This paper presents a novel duplication-based Determination for Parallel Processing”, IEEE Software,
scheduling algorithm for clustering and schedules Jan.1988 pp.23-32,.
DAGs on fully connected homogenous [9] J. Y. Cohn and P. Chretienne, “CPM. Scheduling With
multiprocessors. This algorithm uses task duplication Small Computation Delays and Task Duplication”,
to reduce the length of the schedule. Our performance Operations Research, 1991, ~~680-684.
study showed that it has a relative low complexity [IO] C. Papadimitriou and M. Yannakakis, “Toward an
comparable to SFD algorithms, while outperforming Architecture Independent Analysis of Parallel
both SPD algorithms and SFD algorithms by Algorithms, SIAM J.Computing, Vol. 19, 1990,
generating schedules with much shorter schedule pp.322-328.
lengths. [I I] 1. Ahmad and Y.-K. Kwok. “On Exploiting Task
Duplication in Parallel Program Scheduling”, IEEE References
Trans. Parallel and Distributed Systems, Vol. 9, No. 9,
[I] V. Sarkar. “Partitioning and Scheduling Parallel Sep. 1998, pp. 872892.
Programs for Execution on Multiprocessors”. 1121 S. Darbha and D. P. Agrawal. “Optimal Scheduling
Cambridge, Mass: MIT Press, 1989. Algorithm for Distributed-Memory Machines”, IEEE
[2] A. Radulescu and A. Gemund. “Low-Cost Task Trans. Parallel and Distributed Systems, Vol. 9, No. I,
Scheduling for Distributed-Memory Machines”. IEEE Jan. 1998, pp. 97-95.
Trans. Parallel and Distributed Systems, Vol. 13, No. 6, 1131 G. L. Park, B. Shirazi and J. Marquis. “Mapping of
June 2002, pp. 648-658. Parallel Tasks to Multiprocessors with Duplication”,
[3] A. Gerasoulis and T. Yang. “On the Granularity and Proc. of the 12’h International Parallel Processing
Clustering of Directed Acyclic Task Graphs”. IEEE Symposium (IPPS’98), 1998.
Trans. Parallel and Distributed Systems, Vol. 4, No. 6, 1141 M. A. Palis, J.-C. Liou and D. S. L. Wei, “Task
1993, pp. 686-701. Clustering and Scheduling for Distributed Memory
[4] J. Gu, W. Shu and M.-Y. Wu. “Efficient Local Search Parallel Architectures”, IEEE Trans. Parallel and
for DAG Scheduling”, IEEE Transactions on Parallel Distributed Systems, Vol. 7, No. I, Jan. 1996., pp.
and Distributed Systems, Vol.12, No.6, June 2001, pp. 46-55,
617-627. [I51 S. Ranaweera and D. P. Agrawal. “A Scalable Task
[5] Y.-K. Kwok and 1. Ahmad. “Benchmarking and Duplication Based Scheduling Algorithm for
Comparison of the Task Graph Scheduling Heterogeneous Systems”, International Conference on
Algorithms”, J. Parallel and Distributed Computing, Parallel Processing (ICPP’OO), 2000.
Vol. 59, 1999, pp. 381-422. [Is] B. Shirazi, H. B. Chen and J. Marquis. “Comparative
[61 H. Topcuoglu, S. Hariri and M.-Y. Wu. Study of Task Duplication Static Scheduling versus
“Perfroamance-Effective and Low-Complexity Task Clustering and Non-Clustering Techniques”,
Scheduling for Heterogeneous Computing”, IEEE Concurrency: Practice and Experience, vol. 7, Aug.
Trans. Parallel and Distributed Systems, Vol. 13, No. 3, 1995, pp. 371-389.
Mar. 2002, pp. 260-274.
[7] Y.-C. Chung and S. Ranka. “Application and
0-7695-1926-1/03/$17.00 (C) 2003 IEEEProceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)