An Efficient Algorithm for Real-Time Divisible Load Scheduling

An Efficient Algorithm for Real-Time Divisible Load Scheduling

Anwar Mamat, Ying Lu, Jitender Deogun, Steve GoddardDepartment of Computer Science and Engineering

University of Nebraska - LincolnLincoln, NE 68588

{anwar, ylu, deogun, goddard}@cse.unl.edu

Abstract

Providing QoS and performance guarantees to arbitrar-ily divisible loads has become a significant problem formany cluster-based research computing facilities. Whileprogress is being made in scheduling arbitrarily divisi-ble loads, current approaches are not efficient and donot scale well. In this paper, we propose a linear al-gorithm for real-time divisible load scheduling. Unlikeexisting approaches, the new algorithm relaxes the tightcoupling between the task admission controller and thetask dispatcher. By eliminating the need to generate ex-act schedules in the admission controller, the algorithmavoids high overhead. We experimentally evaluate thenew algorithm. Simulation results demonstrate that thealgorithm scales well, can schedule large numbers of tasksefficiently, and performs similarly to existing approachesin terms of providing real-time guarantees.

1 Introduction

An arbitrarily divisible or embarrassingly parallel work-load can be partitioned into an arbitrarily large numberof load fractions. This workload model is a good ap-proximation of many real-world applications [12], e.g.,distributed search for a pattern in text, audio, graphical,and database files; distributed processing of big measure-ment data files; and many simulation problems. All ele-ments in such an application often demand an identicaltype of processing, and relative to the huge total com-putation, the processing on each individual element isinfinitesimally small. Quite a few scientific applicationsconform to this divisible load task model. For example,the CMS (Compact Muon Solenoid) [10] and ATLAS (AToroidal LHC Apparatus) [6] projects, associated withLHC (Large Hadron Collider) at CERN (European Lab-oratory for Particle Physics), execute cluster-based ap-plications with arbitrarily divisible loads. As such ap-plications become a major type of cluster workloads [27],providing QoS to arbitrarily divisible loads becomes a sig-nificant problem for cluster-based research computing fa-cilities like the U.S. CMS Tier-2 sites [28]. By monitoringthe CMS mailing-list, we have learned that CMS users al-ways want to know task response times when they submittasks to clusters. However, without a good QoS mech-anism, current cluster sites cannot provide these usersgood response time estimations.

Table 1: Sizes of OSG Clusters.

Host Name No. of CPUsfermigrid1.fnal.gov 41863osgserv01.slac.stanford.edu 9103lepton.rcac.purdue.edu 7136cmsosgce.fnal.gov 6942osggate.clemson.edu 5727grid1.oscer.ou.edu 4169osg-gw-2.t2.ucsd.edu 3804u2-grid.ccr.buffalo.edu 2104red.unl.edu 1140

Real-time divisible load scheduling is a well researchedarea [8, 9, 17, 18, 19, 20]. Focusing on satisfying QoS,providing real-time guarantees, and better utilizing clus-ter resources, existing approaches give little emphasis toscheduling efficiency. They assume that scheduling takesmuch less time than the execution of a task, and thusignore the scheduling overhead.

However, clusters are becoming increasingly biggerand busier. In Table 1, we list the sizes of some OSG(Open Science Grid) clusters. As we can see, all of theseclusters have more than one thousand CPUs, with thelargest providing over 40 thousand CPUs. Figure 1 showsthe number of tasks waiting in the OSG cluster at Uni-versity of California, San Diego for two 20-hour periods,demonstrating that at times there could be as many as 37thousand tasks in the waiting queue of a cluster. As thecluster size and workload increase, so does the schedul-ing overhead. For a cluster with thousands of nodes orthousands of waiting tasks, as will be demonstrated inSection 5, the scheduling overhead could be substantialand existing divisible load scheduling algorithms are nolonger applicable due to lack of scalability. For example,to schedule the bursty workload in Figure 1a, the best-known real-time algorithm [8] takes more than 11 hoursto make admission control decisions on the 14,000 tasksthat arrived in an hour, while our new algorithm needsonly 37 minutes.

In this paper, we address the deficiency of existingapproaches and present an efficient algorithm for real-time divisible load scheduling. The time complexity ofthe proposed algorithm is linear in the maximum of thenumber of tasks in the waiting queue and the number ofnodes in the cluster. In addition, the algorithm performssimilarly to previous algorithms in terms of providing

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 20 40 60 80 100 120

Num

ber

of W

aitin

g T

asks

Time (*10 minutes)

(a)

15000

20000

25000

30000

35000

40000

0 20 40 60 80 100 120

Num

ber

of W

aitin

g T

asks

Time (*10 minutes)

(b)

Figure 1: Status of a UCSD Cluster.

real-time guarantees and utilizing cluster resources.The remainder of this paper is organized as follows.

Related work is presented in Section 2. We describe bothtask and system models in Section 3. Section 4 discussesthe real-time scheduling algorithm and Section 5 evalu-ates the algorithm performance. We conclude the paperin Section 6.

2 Related Work

Divisible load theory (DLT) has long been studied andapplied in distributed systems scheduling [7, 27, 29]. Itprovides the foundation for optimally partitioning arbi-trarily divisible loads to distributed resources. Theseworkloads represent a broad variety of real-world appli-cations in cluster and grid computing, such as BLAST(Basic Local Alignment Search Tool) [2], a bioinformaticsapplication, and high energy and particle physics appli-cations in ATLAS (A Toroidal LHC Apparatus) [6] andCMS (Compact Muon Solenoid) [10] projects. Currently,large clusters usually use batch schedulers [14] to handletheir workloads. A Batch scheduler monitors the taskexecution and sends queued jobs to a cluster node whenit becomes available. The goal is to optimize the systemutilization. Satisfying QoS and task real-time constraintsare, however, not a major objective of such schedulers.Those clusters, therefore, cannot provide users good re-sponse time estimations.

The scheduling models investigated for distributed ormultiprocessor systems most often (e.g., [1, 5, 15, 16, 23,25, 26]) assume periodic or aperiodic sequential jobs thatmust be allocated to a single resource and executed by

their deadlines. With the evolution of cluster computing,researchers have begun to investigate real-time schedul-ing of parallel applications on a cluster [3, 4, 13, 24, 30].However, most of these studies assume the existence ofsome form of task graph to describe communication andprecedence relations between computational units calledsubtasks (i.e., nodes in a task graph).

One closely related work is scheduling “scalable real-time tasks” in a multiprocessor system [17]. Similar to adivisible load, a scalable task can be executed on morethan one processor and as more processors are allocatedto it, its pure computation time decreases. If we use N torepresent the number of processors and n to denote thenumber of tasks waiting in the system, the time complex-ity of the most efficient algorithms proposed in [17] (i.e.,MWF-FA and EDF-FA) is O(n2 + nN).

There has been some existing work on cluster-basedreal-time divisible load scheduling [8, 9], including ourown previous work [18, 19]. In [19], we developed sev-eral scheduling algorithms for real-time divisible loads.Following those algorithms, a task must wait until a suf-ficient number of processors become available. This couldcause a waste of processing power as some processors areidle when the system is waiting for enough processorsto become available. This system inefficiency is referredto as the Inserted Idle Times (IITs) [18]. To reduce orcompletely eliminate IITs, several algorithms have beendeveloped [8, 9, 18], which enable a task to utilize pro-cessors at different processor available times. These algo-rithms lead to better cluster utilizations, but have highscheduling overheads. The time complexity of algorithmsproposed in [8, 9] is O(nNlogN) and the algorithm in [18]has a time complexity of O(nN3).

In this paper, we propose an efficient algorithm forscheduling real-time divisible loads in clusters. Simi-lar to algorithms in [8, 9, 18], our new algorithm elim-inates IITs. Furthermore, with a time complexity ofO(max(N, n)), the algorithm is efficient and scales wellto large clusters.

3 Task and System Models

In this paper, we adopt the same task and system modelsas those in the previous work [8, 9, 18, 19]. For complete-ness, we briefly present these below.

Task Model. We assume a real-time aperiodic taskmodel in which each aperiodic task τi consists of a singleinvocation specified by the tuple (A, σ, D), where A is thetask arrival time, σ is the total data size of the task, andD is its relative deadline. The task absolute deadline isgiven by A + D.

System Model. A cluster consists of a head node,denoted by P0, connected via a switch to N processingnodes, denoted by P1, P2, . . . , PN . We assume thatall processing nodes have the same computational powerand all links from the switch to the processing nodes havethe same bandwidth. The system model assumes a typi-cal cluster environment in which the head node does notparticipate in computation. The role of the head node is

to accept or reject incoming tasks, execute the schedul-ing algorithm, divide the workload and distribute datachunks to processing nodes. Since different nodes processdifferent data chunks, the head node sequentially sendsevery data chunk to its corresponding processing node viathe switch. We assume that data transmission does notoccur in parallel.1 Therefore, only after the head nodeP0 becomes available can a processing node get its datatransmission started for a new task. Since for arbitrarilydivisible loads, tasks and subtasks are independent, thereis no need for processing nodes to communicate with eachother.

According to divisible load theory, linear models areused to represent processing and transmission times [29].In the simplest scenario, the computation time of a loadσ is calculated by a cost function Cp(σ) = σCps, whereCps represents the time to compute a unit of workloadon a single processing node. The transmission time of aload σ is calculated by a cost function Cm(σ) = σCms,where Cms is the time to transmit a unit of workloadfrom the head node to a processing node.

4 Algorithm

In this section, we present our new algorithm for schedul-ing real-time divisible loads in clusters. Due to their spe-cial property, when scheduling arbitrarily divisible loads,the algorithm needs to make three important decisions.First, it determines the task execution order, which couldbe based on policies like EDF (Earliest Deadline First) orMWF (Maximum Workload derivative First) [17]. Sec-ond, it decides the number n of processing nodes thatshould be allocated to each task. Third, a strategy ischosen to partition the task among the allocated n nodes.

As is typical for dynamic real-time scheduling algo-rithms [11, 22, 25], when a task arrives, the schedulerdetermines if it is feasible to schedule the new task with-out compromising the guarantees for previously admittedtasks. Only those tasks that pass this schedulability testare allowed to enter the task waiting queue (TWQ). Thisdecision module is referred to as the admission controller.When processing nodes become available, the dispatchermodule partitions each task and dispatches subtasks toexecute on processing nodes. Both modules, admissioncontroller and dispatcher, run on the head node.

For existing divisible load scheduling algorithms [8, 9,17, 18, 19], in order to perform the schedulability test,the admission controller generates a new schedule for thenewly arrived task and all tasks waiting in TWQ. If theschedule is feasible, the new task is accepted; otherwise,it is rejected. For these algorithms, the dispatcher acts asan execution agent, which simply implements the feasibleschedule developed by the admission controller. Thereare two factors that contribute to large overheads of thesealgorithms. First, to make an admission control decision,they reschedule tasks in TWQ. Second, they calculate

1It is straightforward to generalize our model and include thecase where some pipelining of communication may occur.

in the admission controller the minimum number nmin

of nodes required to meet a task’s deadline so that itguarantees enough resources for each task. The later atask starts, the more nodes are needed to complete itbefore its deadline. Therefore, if a task is rescheduled tostart at a different time, the nmin of the task may changeand needs to be recomputed. This process of reschedulingand recomputing nmin of waiting tasks introduces a bigoverhead.

To address the deficiency of existing approaches, wedevelop a new scheduling algorithm, which relaxes thetight coupling between the admission controller and thedispatcher. As a result, the admission controller nolonger generates an exact schedule, avoiding the highoverhead. To carry out the schedulability test, insteadof computing nmin and deriving the exact schedule, theadmission controller assumes that tasks are executed oneby one with all processing nodes. This simple and effi-cient all nodes assignment (ANA) policy speeds up theadmission control decision. The ANA is, however, im-practical. In a real-life cluster, resources are shared andeach task is assigned just enough resources to satisfy itsneeds. For this reason, when dispatching tasks for ex-ecution, our dispatcher needs to adopt a different nodeassignment strategy. If we assume ANA in the admis-sion controller and let the dispatcher apply the minimumnode assignment (MNA) policy, we reduce the real-timescheduling overhead but still allow the cluster to have aschedule that is appealing in the practical sense. Fur-thermore, our dispatcher dispatches a subtask as soon asa processing node and the head node become available,eliminating IITs.

Due to the superior performance of EDF-based divis-ible load scheduling [19], our new algorithm schedulestasks in EDF order as well.2 In the following, we de-scribe in detail the two modules of the algorithm: admis-sion controller (Section 4.1) and dispatcher (Section 4.2).Since the two modules follow different rules, sometimesan adjustment of the admission controller is needed toresolve their discrepancy so that task real-time proper-ties can always be guaranteed (Section 4.3). Section 4.4proves the correctness of our algorithm.

4.1 Admission Controller

When a new task arrives, the admission controller de-termines if it is feasible to schedule the new task with-out compromising the guarantees for previously admittedtasks. In the previous work [8, 9, 17, 18, 19, 20], the ad-mission controller follows a brute-force approach, whichinserts the new task into TWQ, reschedules each task andgenerates a new schedule. Depending on the feasibilityof the new schedule, the new task is either accepted orrejected. As we can see, both accepting and rejecting atask involve generating a new schedule.

In this paper, we make two significant changes in or-der to develop a new admission control algorithm. First,

2Although in this paper, we describe the algorithm assum-ing EDF scheduling, the idea is applicable to other divisible loadscheduling such as MWF-based scheduling algorithms [17].

to determine the schedulability of a new task, we onlycheck the information recorded with the two adjacenttasks (i.e., the preceding and succeeding tasks). Unlikethe previous work, our new algorithm could reject a taskwithout generating a new schedule. This significantlyreduces the scheduling overhead for heavily loaded sys-tems. Second, we separate the admission controller fromthe dispatcher, and to make admission control decisions,an ANA policy is assumed.

The new admission control algorithm is called AC-FAST. Algorithm 1 presents its pseudo code. The ad-mission controller assumes an ANA policy. We use E andC to respectively denote the task execution time and thetask completion time. AC-FAST partitions each task fol-lowing the divisible load theory (DLT), which states thatthe optimal execution time is obtained when all nodes al-located to a task complete their computation at the sametime [29]. Applying this optimal partitioning, we get theexecution time of running a task τ(A, σ, D) on N pro-cessing nodes as [19],

E(σ, N) =1 − β

1 − βNσ(Cms + Cps), (1)

where β =Cps

Cms + Cps. (2)

When a new task τ arrives, the algorithm first checks ifthe head node P0 will be available early enough to at leastfinish τ ’s data transmission before τ ’s absolute deadline.If not so, task τ is rejected (lines 1-4). As the next step,task τ is tentatively inserted into TWQ following EDForder and τ ’s two adjacent tasks τs and τp (i.e., the suc-ceeding and the preceding tasks) are identified (lines 5-6).By using the information recorded with τs and τp, the al-gorithm further tests the schedulability. First, to checkwhether accepting τ will violate the deadline of any ad-mitted task, the algorithm compares τ ’s execution timeτ.E with its successor τs’s slackmin, which represents theminimum slack of all tasks scheduled after τ . Next, wegive the formal definition of slackmin. Let S denote thetask start time. A task’s slack is defined as,

slack = A + D − (S + E), (3)

which reflects the scheduling flexibility of a task. Startinga task slack time units later does not violate its deadline.Therefore, as long as τ ’s execution time is no more thanthe slack of any succeeding task, accepting τ will not vio-late any admitted task’s deadline. We define τi.slackmin

as the minimum slack of all tasks scheduled after τi−1.That is,

τi.slackmin = min(τi.slack, τi+1.slack, · · · , τn.slack). (4)

If τ ’s execution time is less than its successor τs’sslackmin, accepting τ will not violate any task’s dead-line (lines 7-10).

The algorithm then checks if task τ ’s deadline can besatisfied or not, i.e., to check if τ.(A + D − S) ≥ τ.E ,where the task start time τ.S is the preceding task’s com-pletion time τp.C or τ ’s arrival time τ.A (lines 11-31). If

there is a task in TWQ, then the cluster is busy. For abusy cluster, we do not need to resolve the discrepancybetween the admission controller and the dispatcher andthe task real-time properties are still guaranteed (see Sec-tion 4.4 for a proof). However, if TWQ becomes empty,the available resources could become idle and the admis-sion controller must consider this resource idleness. Asa result, in our AC-FAST algorithm, when a new taskτ arrives into an empty TWQ, an adjustment is made(lines 15-17). The purpose is to resolve the discrepancybetween the admission controller and the dispatcher sothat the number of tasks admitted will not exceed thecluster capacity. For a detailed discussion of this adjust-ment, please refer to Section 4.3. Once a new task τ isadmitted, the algorithm inserts τ into TWQ and modi-fies the slackmin and the estimated completion time oftasks scheduled after τ (lines 22-31).

Time Complexity Analysis. In our AC-FASTalgorithm, the schedulability test is done by checkingthe information recorded with the two adjacent tasks.Since TWQ is sorted, locating τ ’s insertion point takesO(log(n)) time and so do functions getPredecessor(τ)and getSuccessor(τ). Function adjust(τ) runs in O(N)time (see Section 4.3) and it only occurs when TWQ isempty. The time complexity of function updateSlacks isO(n). Therefore, algorithm AC-FAST has a linear i.e.,O(max(N, n)) time complexity.

4.2 Dispatcher

The dispatching algorithm is rather straightforward.When a processing node and the head node become avail-able, the dispatcher takes the first task τ(A, σ, D) inTWQ, partitions the task and sends a subtask of size σto the node, where σ = min (A+D−CurrentT ime

Cms+Cps, σ). The

remaining portion of the task τ(A, σ − σ, D) is left inTWQ. As we can see, the dispatcher chooses a proper sizeσ to guarantee that the dispatched subtask completes nolater than the task’s absolute deadline A + D. Follow-ing the algorithm, all subtasks of a given task completeat the task absolute deadline, except for the last one,which may not be big enough to occupy the node untilthe task deadline. By dispatching the task as soon as theresources become available and letting the task occupythe node until the task deadline, the dispatcher allocatesthe minimum number of nodes to each task.

To illustrate by an example, if two tasks τ1 and τ2 areput into TWQ, from the admission controller’s point ofview, they will execute one by one using all nodes of thecluster (see Figure 2a); in reality, they are dispatchedand executed as shown in Figure 2b, occupying the min-imum numbers of nodes needed to meet their deadlinerequirements.

4.3 Admission Controller Adjustment

As discussed in previous sections, the admission con-troller assumes a different schedule than the one adoptedby the dispatcher. If TWQ is not empty, the resources are

Algorithm 1 AC-FAST(τ(A, σ, D), TWQ)

1: //check head node’s available time2: if (τ.(A + D) ≤ P0.AvailableTime + τ.σCms) then3: return false4: end if5: τp = getPredecessor(τ)6: τs = getSuccessor(τ)7: τ.E = E(τ.σ, N)8: if (τs �= null && τ.E > τs.slackmin) then9: return false

10: end if11: if (τp == null) then12: τ.S = τ.A13: else14: τ.S = τp.C15: if (TWQ == ∅) then16: adjust(τ)17: end if18: τ.S = max(τ.S, τ.A)19: end if20: if τ.(A + D − S) < τ.E then21: return false22: else23: τ.slack = τ.(A + D − S − E)24: τ.C = τ.(S + E)25: TWQ.insert(τ)26: updateSlacks(τ , TWQ)27: for (τi ∈ TWQ && τi.(A + D) > τ.(A + D)) do28: τi.C+ = τ.E29: end for30: return true31: end if

Algorithm 2 updateSlacks(τ(A, σ, D),TWQ)

1: for (τi ∈ TWQ ) do2: if (τi.(A + D) > τ.(A + D)) then3: τi.slack = τi.slack − τ.E4: end if5: end for6: i = TWQ.length;7: τi.slackmin = τi.slack8: for (i = TWQ.length - 1; i ≥ 1; i −−) do9: τi.slackmin = min(τi.slack, τi+1.slackmin)

10: end for

always utilized. In this case, the admission controller canmake correct decisions assuming the ANA policy withoutdetailed knowledge of the system. The admitted tasks aredispatched following the MNA policy and are always suc-cessfully completed by their deadlines. However, if TWQis empty, some resources may be idle until the next taskarrival. At that point, the admission controller has toknow the system status so that it takes resource idlenessinto account to make correct admission control decisions.

We illustrate this problem in Figure 3. τ1 arrives attime 0. The admission controller accepts it and estimatesit to complete at time 7 (Figure 3a). However, becauseτ1 has a loose deadline, the dispatcher does not allocate

Figure 2: An Example Scenario (a) Admission Con-troller’s View (b) Actual Task Execution.

Figure 3: An Illustration of the Problem (a) AdmissionController’s View (b) An Incorrect Task Execution.

all four nodes but the minimum number, one node, to τ1

and completes it at time 20 (Figure 3b). Task τ2 arrivesat an empty TWQ at time 6 with an absolute deadline of14. The nodes P2, P3, P4 are idle during the time interval[4, 6]. If the admission controller were not to consider thisresource idleness, it would assume that all four nodesare busy processing τ1 during the interval [4, 6] and areavailable during the interval [7, 14]. And thus, it wouldwrongly conclude that τ2 can be finished with all fournodes before its deadline. However, if τ2 were accepted,the dispatcher cannot allocate all four nodes to τ2 at time6, because node P1 is still busy processing τ1. With justthree nodes available during the interval [6, 20], τ2 cannotcomplete until time 15 and misses its deadline.

To solve this problem, when a new task arrives atan empty TWQ, the admission controller invokes Algo-rithm 3 to compute the idle time and make a properadjustment. The algorithm first computes the workload(σidle) that could have been processed using the idled re-sources (lines 1-6). According to Eq (1), we know, withall N nodes, it takes w = 1−β

1−βN σidle(Cms + Cps) timeunits to execute the workload σidle (line 7). To considerthis idle time effect, the admission controller inserts an

Algorithm 3 adjust(τ)

1: TotalIdle = 02: for (i = 1; i ≤ N ; i + +) do3: r = max(Pi.AvailableTime, P0.AvailableTime)4: TotalIdle += max(A − r, 0)5: end for6: σidle = TotalIdle

Cms+Cps

7: w = 1−β1−βN σidle(Cms + Cps)

8: τ.S+ = w

idle task of size σidle before τ and postpones τ ’s starttime by w (line 8).

4.4 Correctness of the Algorithm

In this section, we prove all tasks that have been ad-mitted by the admission controller can be dispatchedsuccessfully by the dispatcher and finished before theirdeadlines. For simplicity, in this section, we use Ai, σi,and Di to respectively denote the arrival time, the datasize, and the relative deadline of task τi. We prove bycontradiction that no admitted task misses its deadline.Let us assume τm is the first task in TWQ that misses itsdeadline at dm = Am + Dm. We also assume that tasksτ0, τ1, · · · , τm−1 have been executed before τm. Amongthese preceding tasks, let τb be the latest that has arrivedat an empty cluster. That is, tasks τb+1, τb+2, · · · , τm

have all arrived at times when there is at least one taskexecuting in the cluster. Since only tasks that are as-sumed to finish by their deadlines are admitted, tasksexecute in EDF order, and τb, τb+1, · · · , τm are all ad-mitted tasks, we know that the admission controller hasassumed that all these tasks can complete by τm’s dead-line dm. Let σAN denote the total workload that hasbeen admitted to execute in the time interval [Ab, dm].We have,

σAN ≥m∑

i=b

σi. (5)

Since all dispatched subtasks are guaranteed to finishby their deadlines (Section 4.2), task τm missing its dead-line means at time dm a portion of τm is still in TWQ.That is, the total workload σMN dispatched in the timeinterval [Ab, dm] must be less than

∑mi=b σi. With Eq (5),

we have,σAN > σMN . (6)

Next, we prove that Eq (6) cannot hold.As mentioned earlier, tasks τb+1, τb+2, · · · , τm have all

arrived at times when there is at least one task execut-ing in the cluster. However, at their arrival times, TWQcould be empty. As described in Section 4.3, when atask arrives at an empty TWQ, an adjustment functionis invoked to allow the admission controller to take re-source idleness into account. Following the function (Al-gorithm 3), the admission controller properly postponesthe new task τ ’s start time by w, which is equivalentto the case where the admission controller “admits” and

inserts before τ an idle task τidle of size σidle that com-pletely “occupies” the idled resources present in the clus-ter. Let us assume that τ1, τ2, · · · , τv are the idle tasks“admitted” by the admission controller adjustment func-tion to “complete” in the interval [Ab, dm].

We define σAN as the total workload, including thoseσi, i = 1, 2, · · · , v of idle tasks, that has been admitted toexecute in the time interval [Ab, dm]. σMN is the totalworkload, including those σi, i = 1, 2, · · · , v of idle tasks,that has been dispatched in the time interval [Ab, dm].Then, we have,

σAN = σAN +v∑

i=1

σi, (7)

σMN = σMN +v∑

i=1

σi. (8)

Next, we first prove that σMN ≥ σAN is true. Due tothe space limitation, in this paper, we only provide thesketch of the proof. For detailed derivation and proof,please refer to our technical report [21].

Computation of σAN : σAN is the sum of workloads,including those

∑vi=1 σi of idle tasks, that are admitted

to execute in the time interval [Ab, dm]. To computeσAN , we leverage the following lemma from [21].

Lemma 4.1 [21] For an admission controller that as-sumes the ANA policy, if h admitted tasks are mergedinto one task T , task T’s execution time is equal to thesum of all h tasks’ execution times. That is,

E(h∑

i=1

σi, N) =h∑

i=1

E(σi, N). (9)

Figure 4: Merging Multiple Tasks into One Task.

Since σAN = σAN +∑v

i=1 σi, according to the lemma,we have E(σAN , N) = E(σAN , N)+

∑vi=1 E(σi, N), which

implies that the sum of workloads σAN admitted to ex-ecute in the interval [Ab, dm], equals to the size of thesingle workload that can be processed by the N nodes in[Ab, dm]. According to Eq (1), we have

σAN =dm − Ab

1−β1−βN (Cps + Cms)

. (10)

In addition, it is the sum of workloads assumed to beassigned to each of the N nodes in the interval [Ab, dm].We use σpk

to denote the workload fraction assumed to

be processed by node Pk in the interval [Ab, dm]. Thus,as shown in Figure 5, we have,

σAN =N∑

k=1

σpk. (11)

Figure 5: All Node Assignment Scenario.

Computation of σMN : σMN denotes the total work-load processed in the time interval [Ab, dm]. With idletasks τ1, τ2, · · · , τv completely “occupying” the idled re-sources during the interval [Ab, dm], there are no gaps be-tween “task executions” and the cluster is always “busy”processing σMN = σMN +

∑vi=1 σi. Similar to computing

σAN , we calculate how much workloads are processed byeach of the N nodes in the given interval. We use σ

′pk

todenote the sum of workloads that are processed by nodePk in the interval [Ab, dm]. We have,

σMN =N∑

k=1

σ′pk

. (12)

As shown in [21],∑N

k=1 σ′pk

≥ ∑Nk=1 σpk

. Thus, wehave,

σMN ≥ σAN . (13)

With Equations (13), (7), and (8), we conclude thatσMN ≥ σAN is true, which contradicts Eq (6). There-fore, the original assumption does not hold and no taskmisses its deadline.

5 Evaluation

In the previous section, we presented an efficient divis-ible load scheduling algorithm. Since the algorithm isbased on EDF scheduling and it eliminates IITs, we useFAST-EDF-IIT to denote it. The EDF-based algorithmproposed in [18] is represented by EDF-IIT-1 and thatin [8] by EDF-IIT-2. This section evaluates their perfor-mance.

We have developed a discrete simulator, called DLSim,to simulate real-time divisible load scheduling in clusters.This simulator, implemented in Java, is a component-based tool, where the main components include a work-load generator, a cluster configuration component, a real-time scheduler, and a logging component. For every sim-ulation, three parameters, N , Cms and Cps are specifiedfor a cluster.

5.1 Real-Time Performance

We first evaluate the algorithm’s real-time performance.The workload is generated following the same approach

as described in [18, 19] and due to the space limitation,we choose not to repeat the details here. Similar to thework by Lee et al. [17], we adopt a metric SystemLoad= E(Avgσ, 1) λ

N to represent how loaded a cluster is fora simulation, where Avgσ is the average task data size,E(Avgσ, 1) is the execution time of running an averagesize task on a single node (see Eq (1) for E ’s calculation),and λ

N is the average task arrival rate per node. Toevaluate the real-time performance, we use two metrics— Task Reject Ratio and System Utilization. Task rejectratio is the ratio of the number of task rejections to thenumber of task arrivals. The smaller the ratio, the betterthe performance. In contrast, the greater the systemutilization, the better the performance.

For the simulations in this subsection, we assume thatthe cluster is lightly loaded and thus we can ignore thescheduling overheads. In these simulations, we observethat all admitted tasks complete successfully by theirdeadlines. Figure 6 illustrates the algorithm’s Task Re-ject Ratio and System Utilization. As we can see, amongthe three algorithms, EDF-IIT-2 provides the best real-time performance, achieving the least Task Reject Ratioand the highest System Utilization, while FAST-EDF-IITperforms better than EDF-IIT-1. The reason that FAST-EDF-IIT does not have the best real-time performance isdue to its admission controller’s slightly pessimistic esti-mates of the data transmission blocking time (Section 4).Focusing on reducing the scheduling overhead, FAST-EDF-IIT trades real-time performance for algorithm effi-ciency. In the next subsection, we use experimental datato demonstrate that in busy clusters with long task wait-ing queues, scheduling overheads become significant andinefficient algorithms like EDF-IIT-1 and EDF-IIT-2 canno longer be applied, while FAST-EDF-IIT wins for itshuge advantages in scheduling efficiency.

5.2 Scheduling Overhead

A second group of simulations are carried out to eval-uate the overhead of the scheduling algorithms. Beforediscussing the simulations, we first present some typicalcluster workloads, which lay out the rationale for oursimulations.

In Figure 1, we have shown the TWQ status of a clus-ter at University of California, San Diego. From thecurves, we observe that 1) waiting tasks could increasefrom 3, 000 to 17, 000 in one hour (Figure 1a) and increasefrom 15, 000 to 25, 000 in about three hours (Figure 1b)and 2) during busy hours, there could be on average morethan 5, 000 and a maximum of 37, 000 tasks waiting in acluster. Similarly busy and bursty workloads have alsobeen observed in other clusters (Figure 7) and are quitecommon phenomena.3 Based on these typical workloadpatterns, we design our simulations and evaluate the al-gorithm’s scheduling overhead.

3To illustrate the intensity and commonness of the phenomena,Figures 1 and 7 show the TWQ statistics on an hourly and a dailybasis respectively.

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Tas

k R

ejec

t Rat

io

System Load

N=256,Cms=1,Cps=1000,Avg σ=200, DCRatio=2

FAST-EDF-IITEDF-IIT-1EDF-IIT-2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Syst

em U

tiliz

atio

n

System Load

N=256,Cms=1,Cps=1000,Avg σ=200, DCRatio=2


Figure 6: Algorithm’s Real-Time Performance.

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 5 10 15 20 25 30

Num

ber

of W

aitin

g T

asks

Time (day)

(a) Red Cluster at Univ. of Nebraska - Lincoln

0

2000

4000

6000

8000

10000

12000

14000

16000

0 5 10 15 20 25 30

Num

ber

of W

aitin

g T

asks

Time (day)

(b) GLOW Cluster at Univ. of Wisconsin

Figure 7: Typical Cluster Status.

In this group of simulations, the following parame-ters are set for the cluster: N=512 or 1024, Cms=1and Cps=1000. We choose to simulate modest-size clus-ters (i.e., those with 512 or 1024 nodes). According toour analysis, the time complexities of algorithms FAST-EDF-IIT, EDF-IIT-1 and EDF-IIT-2 are respectively

O(max (N, n)), O(nN3) and O(nNlog(N)). Therefore, ifwe show by simulation data that in modest-size clustersof N=512 or 1024 nodes FAST-EDF-IIT leads to muchless overheads, then we know for sure that it will be evenmore advantageous if we apply it in larger clusters likethose listed in Table 1.

To create cases where we have a large number of tasksin TWQ, we first submit a huge task to the cluster. Sinceit takes the cluster a long time to finish processing thisone task, we can submit thousands of other tasks andget them queued up in TWQ. As new tasks arrive, theTWQ length is built up. In order to control the numberof waiting tasks and create the same TWQ lengths forthe three scheduling algorithms, tasks are assigned longdeadlines so that they will all be admitted and put intoTWQ. That is, in this group of simulations, we force taskreject ratios to be 0 for all three algorithms so that themeasured scheduling overheads of the three are compa-rable.

We first measure the average scheduling time of thefirst n tasks, where n is in the range [100, 3000]. Thesimulation results for the 512-node cluster are shown inTable 2 and Figure 8. From the data, we can see that forthe first 3, 000 tasks, FAST-EDF-IIT spends an averageof 48.87ms to admit a task, while EDF-IIT-1 and EDF-IIT-2 average respectively 6206.91ms and 1494.91ms, 127and 30 times longer than FAST-EDF-IIT.

Table 2: 512-Node Cluster: First n Tasks’ AverageScheduling Time (ms).

n FAST-EDF-IIT EDF-IIT-1 EDF-IIT-2300 0.96 410.44 151.321000 4.84 1321.08 494.072000 20.46 3119.76 988.953000 48.87 6206.91 1494.91

0

1000

2000

3000

4000

5000

6000

7000

0 500 1000 1500 2000 2500 3000

Tim

e (m

s)

n


Figure 8: 512-Node Cluster: Algorithm’s Real-TimeScheduling Overhead: First n Tasks’ Average Schedul-ing Time.

Because the scheduling overhead increases with thenumber of tasks in TWQ, we then measure the taskscheduling time after n tasks are queued up in TWQ. Ta-ble 3 shows the average scheduling time of 100 new tasksafter there are already n tasks in TWQ of the 512-nodecluster. The corresponding curves are in Figure 9. As

shown, when there are 3, 000 waiting tasks, FAST-EDF-IIT takes 157ms to admit a task, while EDF-IIT-1 andEDF-IIT-2 respectively spend about 31 and 3 seconds tomake an admission control decision.

Table 3: 512-Node Cluster: Average Task SchedulingTime (ms) after n Tasks in TWQ.

n FAST-EDF-IIT EDF-IIT-1 EDF-IIT-2300 1.71 850.01 349.221000 16.25 3006.01 1034.212000 67.24 7536.32 2030.483000 157 31173.86 3050.86

0

5000

10000

15000

20000

25000

30000

35000

0 500 1000 1500 2000 2500 3000

Tim

e (m

s)

n


Figure 9: 512-Node Cluster: Algorithm’s Real-TimeScheduling Overhead: Average Scheduling Time after nTasks in TWQ.

Now, let us examine the simulation results and ana-lyze their implication for real-world clusters. It is shownin Figure 1a that the length of TWQ in a cluster couldincrease from 3, 000 to 17, 000 in an hour. From Table 3,we know that for EDF-IIT-1 and EDF-IIT-2, it takes re-spectively more than 31 and 3 seconds to admit a taskwhen the TWQ length is over 3,000. Therefore, to sched-ule the 14, 000 new tasks arrived in that hour, it takesmore than 7,000 and 700 minutes respectively. Even ifwe assume that the last one of the 14, 000 tasks has ar-rived in the last minute of the hour, its user has to waitfor at least 700-60=640 minutes to know if the task isadmitted or not. On the other hand, if FAST-EDF-IITis applied, it takes a total of 37 minutes to make admis-sion control decisions on the 14, 000 tasks. This exampledemonstrates that our new algorithm is much more effi-cient than existing approaches and is the only algorithmthat can be applied in busy clusters. If we analyze thealgorithms using data in Figure 1b where waiting tasksincrease from 15, 000 to 25, 000, the difference in schedul-ing time will be even more striking.

The simulation results for the 1024-node cluster arereported in Table 4 and Figure 10. Due to EDF-IIT-1’shuge overhead and cubic complexity with respect to thenumber of nodes in the cluster, a simulation for a busycluster with a thousand nodes would take weeks — withno new knowledge to be learned from the experiment.Therefore, on the 1024-node cluster, we only simulateEDF-IIT-2 and FAST-EDF-IIT. For easy comparison,

Table 4: First n Tasks’ Average Scheduling Time (ms).

n FAST-EDF-IIT EDF-IIT-2N=1024 N=512 N=1024 N=512

300 1.01 0.96 363.29 151.321000 4.90 4.84 1545.51 494.072000 21.1 20.46 3089.6 988.953000 50 48.87 4923.91 1494.91

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 500 1000 1500 2000 2500 3000

Tim

e (m

s)

n

FAST-EDF-IIT N=512EDF-IIT-2 N=512

FAST-EDF-IIT N=1024EDF-IIT-2 N=1024

Figure 10: Algorithm’s Real-Time Scheduling Overhead:First n Tasks’ Average Scheduling Time.

Table 4 and Figure 10 include not only data for the 1024-node cluster but also those for the 512-node cluster. Asshown by the simulation results, when the cluster size in-creases from 512 to 1024 nodes, the scheduling overheadof FAST-EDF-IIT only increases slightly. FAST-EDF-IIT has a time complexity of O(max (N, n)). Therefore,for busy clusters with thousands of tasks in TWQ (i.e.,n in the range [3000, 17000]), the cluster size increasedoes not lead to a big increase of FAST-EDF-IIT’s over-head. In contrast, EDF-IIT-2, with a time complexity ofO(nNlog(N)), has a much larger scheduling overhead onthe 1024-node cluster than that on the 512-node cluster.

6 Conclusion

This paper presents a novel algorithm for scheduling real-time divisible loads in clusters. The algorithm assumes adifferent scheduling rule in the admission controller thanthat adopted by the dispatcher. Since the admission con-troller no longer generates an exact schedule, the schedul-ing overhead is reduced significantly. Unlike the previousapproaches, where time complexities are O(nN3) [18] andO(nNlog(N)) [8], our new algorithm has a time com-plexity of O(max (N, n)). We prove that the proposedalgorithm is correct, provides admitted tasks real-timeguarantees, and utilizes cluster resources well. We ex-perimentally compare our algorithm with existing ap-proaches. Simulation results demonstrate that it scaleswell and can schedule large numbers of tasks efficiently.With growing cluster sizes, we expect our algorithm tobe even more advantageous.

References

[1] T. F. Abdelzaher and V. Sharma. A synthetic utilizationbound for aperiodic tasks with resource requirements.In Proc. of 15th Euromicro Conference on Real-TimeSystems, pages 141–150, Porto, Portugal, July 2003.

[2] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lip-man. Basic local alignment search tool. Journal ofMolecular Biology, pages 403–410, 1990.

[3] A. Amin, R. Ammar, and A. E. Dessouly. Schedulingreal time parallel structure on cluster computing withpossible processor failures. In Proc of 9th IEEE Interna-tional Symposium on Computers and Communications,pages 62–67, July 2004.

[4] R. A. Ammar and A. Alhamdan. Scheduling real timeparallel structure on cluster computing. In Proc. of 7thIEEE International Symposium on Computers and Com-munications, pages 69–74, Taormina, Italy, July 2002.

[5] J. Anderson and A. Srinivasan. Pfair scheduling: Be-yond periodic task systems. In In Proc. of the 7th Inter-national Conference on Real-Time Computing Systemsand Applications, pages 297–306, Cheju Island, South,Korea, December 2000.

[6] ATLAS (A Toroidal LHC Apparatus) Experiment,CERN (European Lab for Particle Physics). Atlas webpage. http://atlas.ch/.

[7] V. Bharadwaj, T. G. Robertazzi, and D. Ghose. Schedul-ing Divisible Loads in Parallel and Distributed Systems.IEEE Computer Society Press, Los Alamitos, CA, 1996.

[8] S. Chuprat and S. Baruah. Scheduling divisible real-time loads on clusters with varying processor start times.In 14th IEEE International Conference on Embeddedand Real-Time Computing Systems and Applications(RTCSA ’08), pages 15–24, Aug 2008.

[9] S. Chuprat, S. Salleh, and S. Baruah. Evaluation ofa linear programming approach towards scheduling di-visible real-time loads. In International Symposium onInformation Technology, pages 1–8, Aug 2008.

[10] Compact Muon Solenoid (CMS) Experiment forthe Large Hadron Collider at CERN (Euro-pean Lab for Particle Physics). Cms web page.http://cmsinfo.cern.ch/Welcome.html/.

[11] M. L. Dertouzos and A. K. Mok. Multiprocessor onlinescheduling of hard-real-time tasks. IEEE Trans. Softw.Eng., 15(12):1497–1506, 1989.

[12] M. Drozdowski and P. Wolniewicz. Experiments withscheduling divisible tasks in clusters of workstations. In6th International Euro-Par Conference on Parallel Pro-cessing, pages 311–319, August 2000.

[13] M. Eltayeb, A. Dogan, and F. Ozguner. A data schedul-ing algorithm for autonomous distributed real-time ap-plications in grid computing. In Proc. of 33rd Interna-tional Conference on Parallel Processing, pages 388–395,Montreal, Canada, August 2004.

[14] Y. Etsion and D. Tsafrir. A short survey of commer-cial cluster batch schedulers. Technical Report 2005-13,School of Computer Science and Engineering, the He-brew University, Jerusalem, Israel, May 2005.

[15] S. Funk and S. Baruah. Task assignment on uniform het-erogeneous multiprocessors. In Proc of 17th EuromicroConference on Real-Time Systems, pages 219–226, July2005.

[16] D. Isovic and G. Fohler. Efficient scheduling of sporadic,aperiodic, and periodic tasks with complex constraints.In Proc. of 21st IEEE Real-Time Systems Symposium,pages 207–216, Orlando, FL, November 2000.

[17] W. Y. Lee, S. J. Hong, and J. Kim. On-line scheduling ofscalable real-time tasks on multiprocessor systems. Jour-nal of Parallel and Distributed Computing, 63(12):1315–1324, 2003.

[18] X. Lin, Y. Lu, J. Deogun, and S. Goddard. Real-time di-visible load scheduling with different processor availabletimes. In Proceedings of the 2007 International Confer-ence on Parallel Processing (ICPP 2007).

[19] X. Lin, Y. Lu, J. Deogun, and S. Goddard. Real-time di-visible load scheduling for cluster computing. In Proceed-ings of the 13th IEEE Real-Time and Embedded Technol-ogy and Application Symposium, pages 303–314, Belle-vue, WA, April 2007.

[20] A. Mamat, Y. Lu, J. Deogun, and S. Goddard. Real-time divisible load scheduling with advance reservations.In 20th Euromicro Conference on Real-Time Systems,pages 37–46, July 2008.

[21] A. Mamat, Y. Lu, J. Deogun, and S. Goddard. An effi-cient algorithm for real-time divisible load scheduling.Technical Report TR-UNL-CSE-2009-0008, Universityof Nebraska-Lincoln, 2009.

[22] G. Manimaran and C. S. R. Murthy. An efficient dy-namic scheduling algorithm for multiprocessor real-timesystems. IEEE Trans. on Parallel and Distributed Sys-tems, 9(3):312–319, 1998.

[23] P. Pop, P. Eles, Z. Peng, and V. Izosimov.Schedulability-driven partitioning and mapping formulti-cluster real-time systems. In Proc. of 16th Eu-romicro Conference on Real-Time Systems, pages 91–100, July 2004.

[24] X. Qin and H. Jiang. Dynamic, reliability-drivenscheduling of parallel real-time jobs in heterogeneoussystems. In Proc. of 30th International Conferenceon Parallel Processing, pages 113–122, Valencia, Spain,September 2001.

[25] K. Ramamritham, J. A. Stankovic, and P. fei Shiah. Effi-cient scheduling algorithms for real-time multiprocessorsystems. IEEE Trans. on Parallel and Distributed Sys-tems, 1(2):184–194, April 1990.

[26] K. Ramamritham, J. A. Stankovic, and W. Zhao. Dis-tributed scheduling of tasks with deadlines and re-source requirements. IEEE Transactions on Computers,38(8):1110–1123, 1989.

[27] T. G. Robertazzi. Ten reasons to use divisible load the-ory. Computer, 36(5):63–68, 2003.

[28] D. Swanson. Personal communication. Director, UNLResearch Computing Facility (RCF) and UNL CMSTier-2 Site, August 2005.

[29] B. Veeravalli, D. Ghose, and T. G. Robertazzi. Divis-ible load theory: A new paradigm for load schedulingin distributed systems. Cluster Computing, 6(1):7–17,2003.

[30] L. Zhang. Scheduling algorithm for real-time applica-tions in grid environment. In Proc. of IEEE Interna-tional Conference on Systems, Man and Cybernetics, Oc-tober 2002.

An Efficient Algorithm for Real-Time Divisible Load Scheduling

Documents