An Adaptive Scheduling Algorithm for Dynamic Heterogeneous ...downd/cascon11.pdf · An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems Aysan Rasooli , Douglas

An Adaptive Scheduling Algorithm for Dynamic

Heterogeneous Hadoop Systems

Aysan Rasooli , Douglas G. Down

Department of Computing and SoftwareMcMaster University

{rasooa, downd}@mcmaster.ca

Abstract

The MapReduce and Hadoop frameworks weredesigned to support efficient large scale com-putations. There has been growing interestin employing Hadoop clusters for various di-verse applications. A large number of (hetero-geneous) clients, using the same Hadoop clus-ter, can result in tensions between the vari-ous performance metrics by which such sys-tems are measured. On the one hand, fromthe service provider side, the utilization of theHadoop cluster will increase. On the otherhand, from the client perspective the paral-lelism in the system may decrease (with a corre-sponding degradation in metrics such as meancompletion time). An efficient scheduling al-gorithm should strike a balance between uti-lization and parallelism in the cluster to ad-dress performance metrics such as fairness andmean completion time. In this paper, we pro-pose a new Hadoop cluster scheduling algo-rithm, which uses system information such asestimated job arrival rates and mean job exe-cution times to make scheduling decisions. Theobjective of our algorithm is to improve meancompletion time of submitted jobs. In additionto addressing this concern, our algorithm pro-

Copyright c© 2011 Aysan Rasooli and Dr. Dou-glas G. Down. Permission to copy is hereby grantedprovided the original copyright notice is reproduced incopies made.

vides competitive performance under fairnessand locality metrics (with respect to other well-known Hadoop scheduling algorithms - FairSharing and FIFO). This approach can be ef-ficiently applied in heterogeneous clusters, incontrast to most Hadoop cluster schedulingalgorithm work, which assumes homogeneousclusters. Using simulation, we demonstratethat our algorithm is a very promising candi-date for deployment in real systems.

1 Introduction

Cloud computing provides massive clusters forefficient large scale computation and data anal-ysis. MapReduce [5] is a well-known program-ming model which was first designed for im-proving the performance of large batch jobson cloud computing systems. However, thereis growing interest in employing MapReduceand its open-source implementation, calledHadoop, for various types of jobs. This leads tosharing a single Hadoop cluster between mul-tiple users, which run a mix of long batch jobsand short interactive queries on a shared dataset.

Sharing a Hadoop cluster between multipleusers has several advantages, such as statisti-cal multiplexing (lowering costs), data locality(running computation where the data is), andincreasing the utilization of the resources. Dur-

ing the past years companies have used Hadoopclusters for executing specific applications; forinstance Facebook is using a Hadoop clusterfor analyzing usage patterns to improve site de-sign, spam detection, data mining and ad opti-mization [10].

Assigning a new type of job to the cur-rent workload mix on a Hadoop cluster, mayseverely degrade the performance of the sys-tem, which may not be tolerable for certainapplications. Moreover, heterogeneity is a ne-glected issue in most current Hadoop systems,which can also lead to poor performance. Here,heterogeneity can be with respect to both jobsand resources.

The Hadoop scheduler is the centrepiece ofa Hadoop system. Desired performance lev-els can be achieved by proper submission ofjobs to resources. Primary Hadoop schedulingalgorithms, like the FIFO algorithm and theFair-sharing algorithm, are simple algorithmswhich use small amounts of system informa-tion to make quick scheduling decisions. Themain concern in these algorithms is to quicklymultiplex the incoming jobs on the available re-sources. Therefore, they use less system infor-mation. However, a scheduling decision basedon a small amount of system information maycause some disadvantages such as less locality,and neglecting the heterogeneity of the system.As a result, the primary Hadoop scheduling al-gorithms may not be good choices for hetero-geneous systems. In a heterogeneous Hadoopsystem, increasing parallelism without consid-ering the difference between various resourcesand jobs in the system can result in poor per-formance. As a result of such considerations,it is useful to explore the possible performancegains by considering more sophisticated algo-rithms.

Gathering more system information can havesignificant impact on making better schedulingdecisions. It is possible to gather some Hadoopsystem information [7], which can be used inscheduling decisions. Research at UC-Berkeley[2] provides a means to estimate the mean jobexecution time based on the structure of thejob, and the number of map and reduce tasksin each job. Moreover, in most Hadoop sys-tems, multiple types of jobs are repeating ac-cording to various patterns. For example, the

Spam detector applications run on the Face-book Hadoop cluster every night. Therefore, itis also may be possible to estimate the arrivalrates of job types in some Hadoop systems.

In this paper, we introduce a Hadoopscheduling algorithm which uses this system in-formation to make appropriate scheduling de-cisions. The proposed algorithm takes into ac-count the heterogeneity of both resources andjobs in assigning jobs to resources. Using thesystem information, it classifies the jobs intoclasses and finds a matching of the job classesto the resources based on the requirements ofthe job classes and features of the resources.At the time of a scheduling decision, the algo-rithm uses the matchings of the resources andthe classes, and considers the priority, requiredminimum share, and fair share of users to makea scheduling decision. Our proposed algorithmis dynamic, and it updates its decisions basedon changes in the parameters of the system.

We extend a Hadoop simulator, MRSIM [6],to evaluate our proposed algorithm. We im-plement the four most common performancemetrics for Hadoop systems: locality, fairness,satisfying the minimum share of the users, andmean completion time of jobs. We compare theperformance of our algorithm with two com-monly used Hadoop scheduling algorithms, theFIFO algorithm and the Fair-sharing algorithm[9]. The results show that our proposed algo-rithm has significantly better performance inreducing the mean completion time, and satis-fying the required minimum shares. Moreover,its performance in the Locality and the Fair-ness performance metrics is very competitivewith the other two algorithms. To the best ofour knowledge, there is no Hadoop schedulingalgorithm which simultaneously considers joband resource heterogeneity. The two main ad-vantages of our proposed algorithm are increas-ing the utilization of the Hadoop cluster, andreducing the mean completion times by consid-ering the heterogeneity in the Hadoop system.

The remainder of this paper is organized asfollows. In Section 2 we give a brief overview ofa Hadoop system. Current Hadoop schedulingalgorithms are given in Section 3. Our Hadoopsystem model is described and formally intro-duced in Section 4. Then, in Section 5, we for-mally present the performance metrics of in-

terest. Our proposed Hadoop scheduling al-gorithm is introduced in Section 6. In Sec-tion 7, details of the environment in whichwe study our algorithm are provided, and westudy the performance of our algorithm in var-ious Hadoop systems. Finally, we provide someconcluding remarks and discuss possible futurework in the last section.

2 Hadoop Systems

Computing todays’ large scale data processingapplications requires thousands of resources.Cloud computing is a paradigm to provide suchlevels of computing resources. However, to har-ness the available resources for improving theperformance of large applications, it is requiredto break down the applications into smallerpieces of computation, and execute them inparallel. MapReduce [5] is a programmingmodel which provides an efficient framework forautomatic parallelization and distribution, I/Oscheduling, and monitoring the status of largescale computations and data analysis.

Hadoop is an open-source implementationof MapReduce for reliable, scalable, and dis-tributed computing. A distributed file systemthat underlies the Hadoop system provides ef-ficient and reliable distributed data storage forapplications involving large data sets. Users inHadoop submit their jobs to the system, whereeach job consists of map functions and reducefunctions. The Hadoop system breaks the sub-mitted jobs into multiple map and reduce tasks.First, Hadoop runs the map tasks on each blockof the input, and computes the key/value pairsfrom each part of the input. Then, it groups in-termediate values by their key. Finally, it runsthe reduce tasks on each group, which providesthe jobs’ final result.

The scheduler is a fundamental component ofthe Hadoop system. Scheduling in the Hadoopsystem is pool based, which means that whena resource is free, it sends a heartbeat to thescheduler. Upon receiving a heartbeat, thescheduler searches through all the queued jobsin the system, chooses a job based on some per-formance metric(s), and sends one task of theselected job to each free slot on the resource.The heartbeat message contains some informa-

tion such as the number of currently free slotson the resource. Various Hadoop schedulingalgorithms consider different performance met-rics in making scheduling decision.

3 Related Work

MapReduce was initially designed for smallteams in which a simple scheduling algorithmlike FIFO can achieve an acceptable perfor-mance level. However, experience from deploy-ing Hadoop in large systems shows that basicscheduling algorithms like FIFO can cause se-vere performance degradation; particularly insystems that share data among multiple users.As a result, the next generation scheduler inHadoop, Hadoop on Demand (HOD) [3], ad-dresses this issue by setting up private Hadoopclusters on demand. HOD allows the users toshare a common file system while owning pri-vate Hadoop clusters on their allocated nodes.This approach failed in practice because it vi-olated the data locality design of the origi-nal MapReduce scheduler, and it resulted inpoor system utilization. To address some ofthese shortcomings, Hadoop recently added ascheduling plug-in framework with two addi-tional schedulers that extend rather than re-place the original the FIFO scheduler.

The additional schedulers are introduced in[9], where they are collectively known as Fair-sharing. In this work, a pool is defined for eachuser, each pool consisting of a number of mapslots and reduce slots on a resource. Each usercan use its pool to execute her jobs. If a pool ofa user becomes idle, the slots of the pool are di-vided among other users to speed up the otherjobs in the system. The Fair-sharing algorithmdoes not achieve good performance regardinglocality. Therefore, in order to improve thedata locality, a complementary algorithm forFair-sharing is introduced in [10], called delayscheduling. Using the delay scheduling algo-rithm, when Fair-sharing chooses a job for thecurrent free resource, and the resource does notcontain the required data of the job, schedul-ing of the chosen job is postponed, and the al-gorithm finds another job. However, to limitthe waiting time of the jobs, a threshold is de-fined; therefore, if scheduling of a job is post-

poned until the threshold is met, the job willbe submitted to the next free resource. Theproposed algorithms can perform much bet-ter than Hadoop’s default scheduling algorithm(FIFO); however, these algorithms do not con-sider heterogeneous systems in which resourceshave different capacities and users submit var-ious types of jobs.

In [1], the authors introduce a scheduling al-gorithm for MapReduce systems to minimizethe total completion time, while improvingthe CPU and I/O utilization of the cluster.The algorithm defines Virtual Machines (VM),and decides how to allocate the VMs to eachHadoop job, and to the physical Hadoop re-sources. The algorithm formulates and solvesa constrained optimization problem. To formu-late the optimization problem a mathematicalperformance model is required for the differentjobs in the system. The algorithm first runsall job types in the Hadoop system to buildcorresponding performance models. Then, as-suming these jobs will be submitted multipletimes to the Hadoop system, scheduling deci-sions for each job are made based on the solu-tion of the defined optimization problem. Thealgorithm assumes that the job characteristicswill not vary between runs, and also when a jobis going to be executed on a resource, all its re-quired data is placed on that node. The prob-lem with this algorithm is that the algorithmcan not make decisions when a new job withnew characteristics joins the system. Moreover,the assumption that all of the data required bya job is available on the running resource, with-out considering the overhead of transmittingthe data is unrealistic. Furthermore, Hadoopis very I/O intensive both for file system accessand Map/Reduce scheduling, so virtualizationincurs a high overhead.

In [8], a Dynamic Priority (DP) parallel taskscheduler is designed for Hadoop, which al-lows users to control their allocated capac-ity by dynamically adjusting their budgets.This algorithm prioritizes the users based ontheir spending, and allows capacity distributionacross concurrent users to change dynamicallybased on user preferences. The core of this al-gorithm is a proportional share resource alloca-tion mechanism that allows users to purchaseor be granted a queue priority budget. This

budget may be used to set spending rates de-noting the willingness to pay a certain amountper Hadoop map or reduce task slot per timeunit.

4 Hadoop System Model

The Hadoop system consists of a cluster, whichis a group of linked resources. The data inthe Hadoop system is organized into files. Theusers submit jobs to the system, where eachjob consists of some tasks. Each task is eithera map task or a reduce task. The Hadoop com-ponents related to our research are describedas follows:

1. The Hadoop system has a cluster. Thecluster consists of a set of resources, whereeach resource has a computation unit, anda data storage unit. The computation unitconsists of a set of slots (in most Hadoopsystems, each CPU core is considered asone slot), and the data storage unit hasa specific capacity. We assume a clusterwith M resources as follows:

Cluster = {R1, . . . , RM}

Rj =< Slotsj ,Memj >

• Slotsj is a set of slots in resource Rj ,where each slot (slotkj ) has a specificexecution rate (exec ratekj ). Gener-ally, slots belonging to one resourcehave the same execution rate.

Slotsj = {slot1j , . . . , slotmj }

• Memj is the storage unit of resourceRj , which has a specific capacity(capacityj) and data retrieval rate(retrieval ratej). The data retrievalrate of resource Rj depends on thebandwidth within the storage unit ofthis resource.

2. Data in the Hadoop system is organizedinto files, which are usually large. Each fileis split into small pieces, which are calledslices (usually, all slices in a system havethe same size). We assume that there areL files in the system, which are defined asfollows:

Files = {F1, . . . , FL}

Fi = {slice1i , . . . , sliceki }

3. We assume that there are N users in theHadoop system, where each user (Ui) sub-mits a set of jobs to the system (Jobsi).

Users = {U1, . . . , UN}

Ui =< Jobsi >

Jobsi = {J1i , . . . , J

ni }

The Hadoop system assigns a priority anda minimum share to each user based ona particular policy (e.g. the pricing pol-icy of [8]). The number of slots as-signed to user Ui depends on her priority(priorityi). The minimum share of a userUi (min sharei) is the minimum numberof slots that the system must provide foruser Ui at each point in time.

In a Hadoop system, the set of jobs of auser is dynamic, meaning that the set ofjobs for user Ui at time t1 may be com-pletely different from the set of jobs of thisuser at time t2. Each job in the systemconsists of a number of map tasks, and re-duce tasks. The sets of map tasks, and re-duce tasks for the job Ji is represented withMapsi, and Redsi, respectively.

Ji = Mapsi ∪Redsi

Each map task k of job i (MT ki ) performssome processes on the slice (slicelj ∈ Fj)where the required data for this task is lo-cated.

Mapsi = {MT 1i , . . . ,MT ki }

Each reduce task k of job i (RT ki ) receivesand processes the results of some of themap tasks of job i.

Redsi = {RT 1i , . . . , RT

ki }

The mean execT ime(Ji, Rj) defines themean execution time of job Ji on resourceRj , and the corresponding execution rateis defined as follows:

mean execRate(Ji, Rj) =

1/mean execT ime(Ji, Rj)

5 Performance Metrics

In this section, first we define the follow-ing functions which return the status of theHadoop system and will be used to define theperformance metrics. Then, we introduce theperformance metrics related to our schedulingproblem.

• Tasks(U, t) and Jobs(U, t) return the setsof tasks and jobs, respectively, of the userU at time t.

• ArriveT ime(J), StartT ime(J), andEndTime(J) return the arrival time,start of execution time, and completiontime of job J , respectively.

• TotalJobs(t) returns the set of all the jobswhich have arrived to the system up totime t.

• RunningTask(slot, t) returns the runningtask (if there is one) on the slot slot attime t. If there is no running task on theslot, the function returns NULL.

• Location(slice, t) returns the set of re-sources (R), which store the slice slice attime t.

• AssignedSlots(U, t) returns the set of slotswhich are executing the tasks of user U attime t.

• CompletedJobs(t) returns the set of alljobs that have been completed up to timet. The function CompletedTasks(t) is de-fined analogously for completed tasks.

• Demand(U, t) returns the set of tasks ofthe user U at time t which have not yetbeen assigned to a slot.

Using the above functions, we define four per-formance metrics that are useful for a Hadoopsystem:

1. MinShareDissatisfaction(t) measureshow successful the scheduling algorithm isin satisfying the minimum share require-ments of the users in the system. If thereis a user in the system, whose currentdemand is not zero, and her current share

is lower than her minimum share, then wecompute her Dissatisfaction as follows:

IF

(|Demand(U, t)| > 0 ∧U.min share > 0 ∧

|AssignedSlots(U, t)| < U.min share)

THEN

Dissatisfaction(U, t) =

U.min share−|AssignedSlots(U,t)|U.min share

×U.priority

ELSE

Dissatisfaction(U, t) = 0

U.priority and U.min share denote thepriority and the minimum share of the userU . MinShareDissatisfaction(t) takesinto account the distance of all the usersfrom their min share. When comparingtwo algorithms, the algorithm which hassmaller MinShareDissatisfaction(t) hasbetter performance.

MinShareDissatisfaction(t) =X∀U∈Users

Dissatisfaction(U, t).

2. Fairness(t) measures how fair a schedul-ing algorithm is in dividing the resourcesamong the users in the system. A fair algo-rithm gives the same share of resources tousers with equal priority. However, whenthe priorities are not equal, then the user’sshare should be proportional to their pri-ority. In order to compute the fairness ofan algorithm, we should take into accountthe slots which are assigned to each userbeyond her minimum share, which is rep-resented with ∆(U, t).

∆(U, t) = AssignedSlots(U, t)−U.min share

Then, the average additional share of allthe users with the same priority (Usersp)is defined as:

avg(p, t) =PU∈Usersp

∆(U, t)

|Usersp|,

Usersp = {U |U ∈ Users ∧ U.priority = p},

and Fairness(t) is computed by the sumof distances of all the users in one prior-ity level from the average amount of thatpriority level.

Fairness(t) =Xp∈priorities

XU∈Usersp

|∆(U, t)− avg(p, t)|.

Comparing two algorithms, the algorithmwhich has lower Fairness(t) achieves bet-ter performance.

3. Locality(t) is defined as the number oftasks which are running on the same re-source as where their stored data are lo-cated. Since in the Hadoop system the in-put data size is large, and the map tasksof one job are required to send their re-sults to the reduce tasks of that job, thecommunication cost can be quite signifi-cant. A map task is defined to be local ona resource R, if it is running on resourceR, and its required slice is also stored onresource R. Comparing two scheduling al-gorithms, the algorithm which has largerLocality(t) has better performance.

4. MeanCompletionT ime(t) is the averagecompletion time of all the completed jobsin the system.

6 Proposed SchedulerModel

In this section we first discuss the characteris-tics of our proposed algorithm, comparing themwith current Hadoop scheduling algorithms.Then, we present our proposed scheduling al-gorithm for the Hadoop system.

6.1 Motivating Our Algorithm

In this part we discuss the important charac-teristics of our proposed algorithm, based onthe challenges of the Hadoop system.

1. Scheduling based on fairness, min-imum share requirements, and theheterogeneity of jobs and resources.

In a Hadoop system satisfying the mini-mum shares of the users is the first criticalissue. The next important issue is fairness.We design a scheduling algorithm whichhas two stages. In the first stage, the algo-rithm considers the satisfaction of the min-imum share requirements of all the users.Then, in the second stage, the algorithmconsiders fairness among all the users inthe system. Most current Hadoop schedul-ing algorithms consider fairness and min-imum share objectives without consider-ing the heterogeneity of the jobs and theresources. One of the advantages of ourproposed algorithm is that while our pro-posed algorithm satisfies the fairness andthe minimum share requirements, it fur-ther matches jobs with resources basedon job features (e.g. estimated executiontime) and resource features (e.g. execu-tion rate). Consequently, the algorithmreduces the completion time of jobs in thesystem.

2. Reducing communication costs. In aHadoop cluster, the network links amongthe resources have varying bandwidth ca-pabilities. Moreover, in a large cluster, theresources are often located far from eachother. The Hadoop system distributestasks among the resources to reduce ajob’s completion time. However, Hadoopdoes not consider the communication costsamong the resources. In a large clusterwith heterogenous resources, maximizinga task’s distribution may result in signifi-cant communication costs. Therefore, thecorresponding job’s completion time willbe increased. In our proposed algorithm,we consider the heterogeneity and distri-bution of resources in the task assignment.

3. Reducing the search overhead formatching jobs and resources. To findthe best matching of jobs and resources, anexhaustive search is required. In our algo-rithm, we use clustering techniques to re-strict the search space. Jobs are classifiedbased on their requirements. Every timea resource is available, it searches throughthe classes of jobs (rather than the individ-ual jobs) to find the best matching (using

optimization techniques). The solution ofthe optimization problem results in the setof suggested classes for each resource. Thesuggested set for each resource is used formaking the routing decision. We limit thenumber of times that this optimization isperformed, in order to avoid adding signif-icant overhead.

4. Increasing locality. In order to in-crease locality in a Hadoop system, weshould increase the probability that tasksare assigned to resources which also storetheir input data. Our algorithm makes ascheduling decision based on the suggestedset of job classes for each resource. There-fore, we can replicate the required data ofthe suggested classes of a resource, on thatresource. Consequently, locality will be in-creased in the system.

6.2 Proposed Algorithm

In this section, we first present a high level viewof our proposed algorithm (in Figure 1), andthen we discuss different parts of our algorithmin more detail.

A typical Hadoop scheduler receives twomain messages: a new job arrival message froma user, and a heartbeat message from a free re-source. Therefore, our proposed scheduler con-sists of two main processes, where each pro-cess is triggered by receiving one of these mes-sages. Upon receiving a new job from a user,the scheduler performs a queueing process tostore the incoming job in an appropriate queue.

When receiving a heartbeat message from aresource, the scheduler triggers the routing pro-cess to assign a job to the free resource. Ouralgorithm uses the job’s classification, so whena new job arrives to the system, the queueingprocess specifies the class of this job, and storesthe job in the queue of its class. The queueingprocess sends the updated information of all ofthe classes to the routing process, where therouting process uses this information to choosea job for the current free resource. In what fol-lows, we provide greater detail for our proposedalgorithm.

Figure 1: High level view of our proposed algorithm

1. Job Execution Time Estimation:When a new job arrives to the system,it is required to estimate its mean exe-cution time on the resources. The TaskScheduler component uses the ProgramAnalyzer to estimate the mean executiontime of the new incoming job on all re-sources (mean execT ime(Ji, Rj)). TheTask Scheduler component has been intro-duced in the AMP lab in UC Berkeley [2].

2. Two Classifications: The Hadoop sys-tem requires that upon a user’s requestat any time, it will provide her minimum

share immediately. Therefore, it is criti-cal for the system to first consider satisfy-ing the minimum shares of all users. Af-ter satisfying the minimum shares, the sys-tem should consider dividing the resourcesamong the users in a fair way (to preventstarvation of any user). Based on thesetwo facts, our algorithm has two classifica-tions: minimum share classification, andfairness classification. In the minimumshare classification, only the jobs whoseusers have min share > 0 are considered,while in the fairness classification all thejobs in the system are considered.

When a user asks for more than her mini-mum share, the scheduler assigns her min-imum share immediately but the extrashare will be assigned fairly after consid-ering all users. As a result, users withmin share > 0 first should be consideredin the minimum share classification, andonce they receive their minimum shares,they should be considered in the fairnessclassification. However, the current shareof a user, and consequently her minimumshare satisfaction can be highly varyingover time, and it is not feasible to gener-ate a new classification each time the min-imum share satisfaction changes. There-fore, we consider a job whose user hasmin share > 0 in both classifications, andmake the scheduling decision for the jobbased on its user’s status at the time ofscheduling.

In our algorithm, both the minimumshare, and the fairness classifications clas-sify the jobs such that the jobs inthe same class have the same features(i.e, priority, execution rate on the re-sources (mean execRate(Ji, Rj)), and ar-rival rate). We define the set of classesgenerated in the minimum share classifi-cation as JobClasses1, where each class isdenoted by Ci. Each class Ci has a spe-cific priority, which is equal to the priorityof the jobs in this class. The estimatedarrival rate of the jobs in class Ci is de-noted by αi, and the estimated executionrate of jobs in class Ci on resource Rj isdenoted by µi,j . Hence, the heterogeneityof resources is completely addressed withµi,j . We assume that the total number ofclasses generated with this classification isF .

JobClasses1 = {C1, . . . , CF }

The fairness classification is the same asthe minimum share classification; however,the difference is that this classification isdone on all the jobs regardless of theirusers’ min share amount. We define theset of classes generated by this classifica-tion as JobClasses2. Each class, denotedby C ′i, has a specific priority, which is equalto the priority of the jobs in this class. The

arrival rate of the jobs in class C ′i is equalto α′i, and the execution rate of the jobs inclass C ′i on resource Rj is represented withµ′i,j . We assume that the total number ofclasses generated with this classification isF ′.

JobClasses2 = {C′1, . . . , C′F ′}

For example, Yahoo uses the Hadoop sys-tem in production for a variety of products(job types) [4]: Data Analytics, ContentOptimization, Yahoo! Mail Anti-Spam,Ad Products, and several other applica-tions. Typically, the Hadoop system de-

User Job Type min share priorityUser1 Advertisement Products 50 3User2 Data Analytics 20 2User3 Advertisement Targeting 40 3User4 Search Ranking 30 2User5 Yahoo! Mail Anti-Spam 0 1User6 User Interest Prediction 0 2

Table 1: The Hadoop System Example (Exp1)

fines a user for each job type, and the sys-tem assigns a minimum share and a prior-ity to each user. For example, assume aHadoop system (called Exp1) with the pa-rameters in Table 1. The jobs in the Exp1system, at a given time t, are presentedin Table 2, where the submitted jobs of auser are based on the user’s job type (e.g.,J4 which is submitted by user1 is an ad-vertisement product, while the job J5 isa search ranking job). The minimum

User Job QueueUser1 {J4, J10, J13, J17}User2 {J1, J5, J9, J12, J18}User3 {J2, J8, J20}User4 {J6, J14, J16, J21}User5 {J7, J15}User6 {J3, J11, J19}

Table 2: The job queues in Exp1 at time t

share classification of the jobs in the Exp1system, at time t, is presented in Figure2. Note that here we assume that thereis just one resource in the system. In asystem which has more than one resource,the mean execution time for each class isrepresented with an array, to show the ex-ecution time of the class on each resource.The fairness classification of system Exp1,

Figure 2: The minimum share classification ofthe jobs in Exp1 system at time t

at time t, is presented in Figure 3. Similarto the minimum share classification, we as-sume that there is just one resource in thesystem.

Figure 3: The fairness classification of the jobsin Exp1 system at time t

3. Optimization approach: In order tofind an appropriate matching of jobs andresources, we define an optimization prob-lem based on the properties of the jobclasses and the features of the resources.The scheduler solves the following linearprogram (LP) for the classes in the setJobClasses1. Here δi,j is defined as theproportion of resource Rj which is allo-cated to class Ci, and λ is the amount thatwe simultaneously increase arrival rates ofall classes. We aim to maximize λ, whilekeeping the system stable.

maxλ

s.t.MX

j=1

Ci.µi,j×δi,j ≥ λ×Ci.αi, for all i = 1, . . . , F,

(1)FX

i=1

δi,j ≤ 1, for all j = 1, . . . ,M, (2)

δi,j ≥ 0, for all i = 1, . . . , F, and j = 1, . . . ,M.(3)

In the above LP, M is the total numberof resources in the system, and F is thetotal number of minimum share classes(|JobClasses1|). This optimization prob-lem tries to minimize the maximum loadover all resources. After solving this LP,we have the allocation matrix δi,j for eachclass Ci and each resource Rj . Based onthe results of this LP, we define the setSCj for each resource Rj as follows:

SCj = {Ci : δi,j 6= 0}

Note that this is the only place we use δi,j ,where its value is just used to find the nonzero amounts for defining the set SCj . Forexample consider a system with two classesof jobs, and two resources (M = 2, F = 2),in which the arrival and execution rates areas follows:

α =ˆ

2.45 2.45˜and µ =

»9 52 1

–Solving the above LP gives λ = 1.0204 and

δ =

»0 0.51 0.5

–.

Therefore, the sets for resources R1 and R2

will be SC1 = {C2} and SC2 = {C1, C2},respectively. These two sets define the sug-gested classes for each resource, i.e. theysuggest that upon receiving a heartbeatfrom resource R1, select a job from classC2. However, upon receiving a heartbeatfrom the resource R2, either choose a jobfrom class C1 or C2. Even though resourceR1 has the fastest rate for class C1, the al-gorithm does not assign any jobs of classC1 to it. This occurs because, the systemis highly loaded, and since µ1,1

µ2,1>

µ1,2µ2,2

andα1 = α2, the mean completion time of thejobs is decreased if resource R1 only exe-cutes class C2 jobs.

A similar optimization problem is used forthe classes defined in the fairness classi-fication. The scheduler solves the follow-ing LP for classes in the set JobClasses2.Here δ′i,j is defined as the proportion of re-source Rj which is allocated to class C ′i,and λ′ is the amount that we simultane-ously increase arrival rates of all classes.We aim to maximize λ′, while keeping thesystem stable.

maxλ′

s.t.MX

j=1

C′i.µ

′i,j × δ

′i,j ≥ λ

′×C′i.α

′i, for all i = 1, . . . , F

′,

(4)

F ′Xi=1

δ′i,j ≤ 1, for all j = 1, . . . ,M, (5)

δ′i,j ≥ 0, for all i = 1, . . . , F

′, and j = 1, . . . ,M.

(6)

As with the LP for the minimum shareclassification, this linear programmingproblem aims to find the best classes forresource allocation based on the require-ments of the jobs, the arrival rates andfeatures of the resources. We define theset SC ′j for each resource Rj as the set ofclasses which are allocated to this resourcebased on the result of this LP. Note thatthis is the only place we use δ′i,j , and weare not using the actual values.

SC′j = {C′i : δ′i,j 6= 0}

4. Job selection: When the scheduler re-ceives a heartbeat from a resource, say Rj ,it triggers the routing process. The firststage in the routing process is the Job Se-lector component. This component selectsa job for each free slot in the resource Rj ,and sends the selected job for each slot tothe Task Scheduler component. The TaskScheduler, introduced in [2], chooses a taskof the selected job to assign to the free slot.

7 Evaluation

In this section we first describe our imple-mented evaluation environment, and later weprovide our experimental results.

7.1 Experimental Environment

To evaluate our proposed algorithm, we usea Hadoop simulator, MRSIM [6]. MRSIMis a MapReduce simulator based on discreteevent simulation, which accurately models theHadoop environment. The simulator on theone hand allows us to measure scalability ofthe MapReduce based applications easily andquickly, while capturing the effects of differ-ent configurations of Hadoop setup on perfor-mance.

We extend this simulator to measure thefour Hadoop performance metrics introducedin Section 5. We also add a job submissioncomponent to the simulator. Using this com-ponent we can define various users with dif-ferent minimum shares, and priorities. Eachuser can submit various types of jobs to thesystem with different arrival rates. Moreover,we add a scheduler component to the simula-tor, which receives the incoming jobs and storesthem in corresponding queues chosen by thesystem scheduling algorithm. Also, upon re-ceiving a heartbeat message, it sends a task tothe free slot of the resource.

Our experimental environment consists of acluster of 6 heterogeneous resources. The re-sources’ features are presented in Table 3. Thebandwidth between the resources is 100Mbps.We define our workload using a Loadgen exam-

Resources Slot Memslot# execRate Capacity RetriveRate

R1 1 500MHz 4GB 40MbpsR2 1 500MHz 4TB 100GbpsR3 1 500MHz 4TB 100GbpsR4 8 500MHz 4GB 40MbpsR5 8 500MHz 4GB 40MbpsR6 8 4.2GHz 4TB 100Gbps

Table 3: Experiment resources

ple job in Hadoop that is used in Hadoop’s in-cluded Gridmix benchmark. Loadgen is a con-figurable job, in which choosing various per-centages for keepMap and keepReduce, we canmake the job equivalent to various workloadsused in Gridmix, such as sort and filter.

We generate four types of jobs in the sys-tem: small jobs, with small I/O and CPUrequirements (they have 1 Map and 1 Re-duce task), I/O-heavy jobs, with large I/O andsmall CPU requirements (they have 10 Map

and 1 Reduce tasks), CPU-heavy jobs, withsmall I/O and large CPU requirements (theyhave 1 Map and 10 Reduce tasks), and largejobs, which have large I/O and large CPU re-quirements (they have 10 Map and 10 Reducetasks). Using these jobs, we define three work-loads: an I/O-Intensive workload, in which alljobs are I/O-bound; a CPU-Intensive work-load; and a mixed workload, which includes alljob types. The workloads are given in Table 4.Considering various arrival rates for the jobs in

Workloads Workload Type Jobs IncludedW1 I/O-Intensivei small, I/O-heavyW2 CPU -Intensivei small, CPU -heavyW3 Mixedi All jobs

Table 4: Experimental workloads

each workload, we define three benchmarks foreach workload in Table 5. Here BMi,j showsthe benchmark j of workload i; for instance,BM1,1 is a benchmark which includes I/O-Intensive jobs, where the arrival rate of smallerjobs is higher than the arrival rate of largerones. In total, we define nine benchmarks torun in our simulated Hadoop environment. We

Benchmarks Arrival rate OrderingBMi,1 Smaller jobs have higher arrival ratesBMi,2 Arrival rates are equal for all jobsBMi,3 Larger jobs have higher arrival rates

Table 5: Experiment benchmarks

submit 100 jobs to the system, which is suffi-cient to contain a variety of the behaviours ina Hadoop system, and is the same number ofjobs used in evaluating most Hadoop schedul-ing systems [9]. The Hadoop block size is setto 64MB, which is the default size in Hadoop0.21. We generate job input data size similar tothe workload used in [9] (which is driven froma real Hadoop workload), where the input dataof a job is defined by the number of map tasksof the job and creating a data set with correctsizes (there is one map task per 64MB inputblock).

7.2 Results

This section provides the results of our exper-iments. In each experiment we compare our

Figure 4: Dissatisfaction performance metric ofthe algorithms in I/O-Intensive workload

Figure 5: Dissatisfaction performance metric ofthe algorithms in CPU-Intensive workload

proposed algorithm with the FIFO algorithmand the version of the Fair-sharing algorithmwhich is presented in [9]. The comparison isbased on the four performance metrics intro-duced in Section 5. For each experiment, werun 30 replications in order to construct 95 per-cent confidence intervals.

Figures 4, 5, and 6 present the Dissat-isfaction metric of the algorithms runningthe benchmarks of the I/O-Intensive, CPU-Intensive, and Mixed workloads, respectively.The lower and upper bounds of the confidenceintervals are represented with lines on each bar.

Based on these results, our proposed algo-rithm can lead to considerable improvement inthe Dissatisfaction performance metric. Thereare a couple of reasons for this improvement.First, our proposed algorithm considers theminimum share satisfaction of the users as itsinitial goal. When receiving a heartbeat froma resource, it first satisfies the minimum shares

Figure 6: Dissatisfaction performance metric ofthe algorithms in Mixed workload

of the users. Second, our algorithm considersthe priority of the users in satisfying their min-imum shares. Therefore, the highest priorityuser who has not yet received her minimumshare will be considered first. However, sincethe algorithm considers the product of the re-maining minimum share and the priority of theuser, it does not let a high priority user withhigh minimum share starve lower priority userswith smaller minimum shares. This is an im-portant issue, which is not considered in theFair-sharing algorithm. As for our algorithm,the Fair-sharing algorithm has the initial goalof satisfying the minimum shares of the users.However, since the Fair-sharing algorithm doesnot change the ordering of the users who havenot received their minimum share, it causeshigher Dissatisfaction in the system. The Fair-sharing algorithm defines pools of jobs, whereeach pool is dedicated to a user. Since the orderof the pools (which present users) is fixed, thealgorithm always checks the users’ minimumshare satisfaction in that order. Therefore, ifthere is a user at the head of this ordering whohas large minimum share requirement and lowpriority, she may create a long delay for theother users with higher priority. Moreover, theFair-sharing algorithm does not consider theusers’ priorities in the order of satisfying theirminimum shares.

Figures 7, 8, and 9 present the Mean Com-pletion Time metric of the algorithms runningthe benchmarks of the I/O-Intensive, CPU-Intensive, and Mixed workloads, respectively.The results show that compared to the other al-

Figure 7: Mean Comp. Time performance met-ric of the algorithms in I/O-Intensive workload

Figure 8: Mean Comp. Time performance met-ric of the algorithms in CPU-Intensive work-load

gorithms, our proposed algorithm achieves thebest mean completion time in all the bench-marks. Compared to the FIFO algorithm, ouralgorithm leads to significant improvement inreducing the mean completion time of the jobs.This significant improvement can be explainedby the fact that unlike the other two algo-rithms, our proposed algorithm considers theheterogeneity in making a proper schedulingdecision based on the job requirements and theresource features.

Table 6 presents the Fairness metric of thealgorithms in the various defined benchmarks.In each benchmark, the table shows the 95%-confidence interval for Fairness when the corre-sponding scheduling algorithm is used. Com-paring the algorithms, the Fair-sharing algo-rithm has the best Fairness. This is as ex-pected, because the main goal of this algorithmis improving the Fairness metric. Our proposed

Figure 9: Mean Comp. Time performance met-ric of the algorithms in Mixed workload

Benchmarks FIFO FAIR MyALG

BM1,1 (14.88, 15.05) (11.59, 11.65) (14.68, 16.08)BM1,2 (14.93, 15.00) (11.57, 11.72) (12.68, 14.60)BM1,3 (14.63, 15.26) (11.59, 11.76) (17.23, 17.65)

BM2,1 (14.77, 15.22) (11.63, 11.98) (11.99, 12.34)BM2,2 (14.83, 15.09) (11.81, 12.12) (13.99, 14.36)BM2,3 (14.42, 15.73) (11.81, 11.94) (17.37, 17.72)

BM3,1 (14.94, 15.37) (11.47, 12.71) (14.11, 15.05)BM3,2 (14.73, 15.62) (11.72, 12.46) (14.41, 14.98)BM3,3 (15.00, 15.44) (11.89, 12.07) (12.11, 13.31)

Table 6: Fairness performance metric of thealgorithms for all workloads

algorithm considers the heterogeneity and as-signs the jobs based on the features of the re-sources. Therefore, it does not blindly assigneach job to each free resource. Moreover, ouralgorithm first satisfies the minimum share ofthe users. Then, after receiving the minimumshare, the corresponding user will be consid-ered along with all other users (second levelof classification of our algorithm) to achievefairness in dividing the shares of the resourcesamong the users in the system. In some bench-marks, our algorithm leads to an increase inthe Fairness metric. However, because of theimportance of the users with non zero mini-mum shares, this side effect may be consideredacceptable. Generally, the minimum share ofthe users are assigned based on business rules,which has higher priority for most companies.As a result, a small increase in Fairness maybe considered acceptable for most Hadoop sys-tems, if it results in better satisfaction of theusers’ minimum shares, and significant reduc-tion in the mean completion time of the jobs.

Benchmarks FIFO FAIR MyALG

BM1,1 (96.60, 98.03) (98.12, 99.08) (98.62, 99.98)BM1,2 (47.39, 57.81) (89.84, 91.76) (93.82, 95.38)BM1,3 (62.93, 65.07) (71.43, 74.57) (66.44, 71.55)

BM2,1 (90.38, 94.42) (97.12, 98.08) (98.56, 99.87)BM2,2 (68.65, 82.15) (93.93, 96.87) (91.78, 95.42)BM2,3 (78.73, 84.07) (94.14, 97.86) (93.78, 97.42)

BM3,1 (73.48, 86.92) (78.77, 83.63) (99.12, 100.00)BM3,2 (92.36, 95.24) (81.27, 87.13) (95.11, 99.69)BM3,3 (79.23, 88.37) (78.02, 86.37) (66.86, 76.73)

Table 7: Locality performance metric of thealgorithms for all workloads

Table 7 presents the Locality metric of thealgorithms in the various defined benchmarks.For each benchmark, the table shows the 95%-confidence interval for Locality when the cor-responding scheduling algorithm is used. Thelocality of our proposed algorithm is close to,and in some cases is better than the Fair-sharing algorithm. This can be explained bythe fact that our algorithm chooses the replica-tion places based on the suggested classes foreach resource.

Another significant feature of our proposedalgorithm is that although it uses sophisticatedapproaches to solve the scheduling problem, itdoes not add considerable overhead. The rea-son is that first, we limit the number of timesrequired to do classification, by considering theaggregate of job features (i.e. mean executiontime and arrival rate). This results in consider-ing the group of job types in each class, ratherthan just one job type. Also, since some jobsin the Hadoop system are submitted multipletimes by users, these jobs do not require chang-ing the classification each time that they aresubmitted to the system.

8 Conclusion and FutureWork

The primary Hadoop scheduling algorithms donot consider the heterogeneity of the Hadoopsystem in making scheduling decisions. In or-der to keep the algorithm simple they used min-imal system information in making schedulingdecisions, which in some cases could result inpoor performance. Growing interest in apply-ing the MapReduce programming model in var-

ious applications gives rise to grather hetero-geneity, and thus must be considered in its im-pact on performance. It has been shown thatit is possible to estimate system parameters ina Hadoop system. Using the system informa-tion, we designed a scheduling algorithm whichclassifies the jobs based on their requirementsand finds an appropriate matching of resourcesand jobs in the system. Our algorithm is com-pletely adaptable to any variation in the sys-tem parameters. The classification part detectschanges and adapts the classes based on thenew system parameters. Also, the mean jobexecution times are estimated when a new jobis submitted to the system, which makes thescheduler adaptable to changes in job execu-tion times. We have received permission to usethe workload from a high profile company, andwe are currently working on defining bench-marks based on this workload, and use themto evaluate our algorithm. Finally, we aim toimplement and evaluate our algorithm in a realHadoop cluster.

ACKNOWLEDGEMENTS

This work was supported by the Natural Sci-ences and Engineering Research Council ofCanada. A major part of this work was donewhile both authors were visiting UC-Berkeley.In particular, the first author would like tothank Ion Stoica and Sameer Agarwal for shar-ing the Task Scheduler part, and also theircomments on our proposed algorithm.

References

[1] A. Aboulnaga, Z. Wang, and Z. Y. Zhang.Packing the most onto your Cloud. In Pro-ceedings of the first international workshopon Cloud data management, 2009.

[2] S. Agarwal and G. Ananthanarayanan.Think global, act local: analyzing thetrade-off between queue delays and local-ity in MapReduce jobs. Technical report,EECS Department, University of Califor-nia, Berkeley, 2010.

[3] Apache. Hadoop on demand documen-tation, 2007. [Online; accessed 30-November-2010].

[4] R. Bodkin. Yahoo! updatesfrom Hadoop Summit 2010.http://www.infoq.com/news/2010/07/yahoo-hadoop-summit, July 2010.

[5] J. Dean and S. Ghemawat. MapRe-duce: simplified data processing on largeclusters. Communications of the ACM,51:107–113, January 2008.

[6] S. Hammoud, M. Li, Y. Liu, N. K. Alham,and Z. Liu. MRSim: a discrete event basedMapReduce simulator. In Proceedings ofthe 7th international conference on FuzzySystems and Knowledge Discovery (FSKD2010), pages 2993–2997. IEEE, 2010.

[7] Kristi Morton, Magdalena Balazinska, andDan Grossman. ParaTimer: a progressindicator for MapReduce DAGs. In Pro-ceeding of the international conference onmanagement of data, pages 507–518, 2010.

[8] T. Sandholm and K. Lai. Dynamic pro-portional share scheduling in Hadoop. InProceedings of the 15th workshop on jobscheduling strategies for parallel process-ing, pages 110–131. Springer, Heidelberg,2010.

[9] M. Zaharia, D. Borthakur, J. S. Sarma,K. Elmeleegy, S. Shenker, and I. Stoica.Job scheduling for multi-user MapReduceclusters. Technical Report UCB/EECS-2009-55, EECS Department, University ofCalifornia, Berkeley, April 2009.

[10] M. Zaharia, D. Borthakur, J. S. Sarma,K. Elmeleey, S. Shenker, and I. Stoica.Delay scheduling: a simple technique forachieving locality and fairness in clusterscheduling. In Proceeding of the Euro-pean conference on computer systems (Eu-roSys), Paris, France, 2010.

An Adaptive Scheduling Algorithm for Dynamic Heterogeneous ...downd/cascon11.pdf · An Adaptive Scheduling Algorithm for Dynamic Heterogeneous Hadoop Systems Aysan Rasooli , Douglas

Documents