RESOURCE‐AWARE SCHEDULING FOR HADOOP Lu Wei Project No: H064420 Supervisor: Professor Tan Kian‐Lee Na>onal University of Singapore School of Compu>ng Department of Informa>on Systems 1
May 25, 2015
RESOURCE‐AWARESCHEDULINGFORHADOOP
LuWei
ProjectNo:H064420
Supervisor:ProfessorTanKian‐Lee
Na>onalUniversityofSingaporeSchoolofCompu>ng
DepartmentofInforma>onSystems
1
MapReduce&Hadoop
2
MapReduce
• DistributeddataprocessingframeworkbyGoogle
• Job– Mapfunc>on– Reducefunc>on
3
HadoopArchitecture
4
Exis>ngSchedulers
5
EarlySchedulers
• FIFO:MapReducedefault,byGoogle– Prioritylevel&submission>me– Datalocality– Problem:starva>onofotherjobsinpresenceofalongrunningjob
• HadoopOnDemand(HOD):byYahoo!– Fairness:Sta>cnodealloca>onusingTorqueResourceManager
– Problem:Poordatalocality&underu>liza>on
6
MainstreamSchedulers
• FairScheduler:byFacebook– Fairness:dynamicresourceredistribu>on
– Challenges:• datalocality–solvedwithdelayedscheduling• Reduce/mapdependence–solvedwithcopy‐computesplibng
• CapacityScheduler:byYahoo!– SimilartoFairScheduler– Specialsupportformemoryintensivejobs
7
Alterna>veSchedulers
• Adap>veScheduler(2010‐2011)– Goal/deadlineorientated– Adap>velyestablishpredic>onsbyjobmatching
– Problem:strongassump>ons&ques>onableperformance
• MachineLearningApproach(2010)– NaïveBayes&Proceptronwiththeaidofuserhints– BeferperformancethanFIFO
– Underu>liza>onduringlearningphase&Overhead
8
Exis>ngSchedulersScheduler Pro Con Resource‐Awareness
FIFO Highthroughput Starva>onofshortjobs
Datalocality
HOD Sharingofcluster Poordatalocality&underu>liza>on
‐
FairScheduler Fairness&dynamicresourcere‐alloca>on
Complicatedconfigura>on
DatalocalityCopy‐computesplibng
CapacityScheduler SimilartoFS SimilartoFS Specialsupportformemoryintensivejobs
Adap>veScheduler Adap>veapproach Strongassump>ons&ques>onableperformance
Resourceu>liza>oncontrolusingjobmatching
MachineLearning ReportedbeferperformancethanFIFO
Underu>liza>onduringlearningphase&overhead
Resourceu>liza>oncontrolusingpafernclassifica>on
9
Mo>va>ons
• HeterogeneitybyConfigura>on– Hardwarecapacitydifferencesamongacluster
• HeterogeneitybyUsage– Alltaskslotsaretreatedequallywithoutconsidera>onsofresourcestatusofcurrentnodeorresourcedemandofqueuingjobs
– PossiblethataCPUbusynodeisassignedaCPUintensivejob;andanI/ObusynodeassignedanI/Ointensivejob
10
Resource‐AwareScheduler
11
DesignOverview
1. Capture– thejob’sresourcedemandcharacteris>cs– theTaskTracker’ssta>ccapability&run>me
usagestatus
2. CombineandTransformintoquan>fiedmeasurements
3. PredicthowfastagivenTaskTrackerisexpectedtofinishagiventask
4. Applyschedulingpolicyofchoice12
DesignDetails
• TaskTrackerProfiling– Resourcescores:representavailability– Sampledeverysecond(ateveryheartbeat)foreachTaskTracker
13
DesignDetails
• TaskBasedJobSampling– Assump>on:
– Targetmeasurements:
– Technique:• Periodicalre‐sampling:avoidover‐relianceononejobsample
€
tsample = ts−cpu + ts−disk + ts−network
Taskresourcedemand
TaskTrackerresourcestatuses
14
• TaskProcessingTimeEs>ma>on
DesignDetails
€
testimate = te−cpu + te−disk + te−network
€
testimate = ts−cpu ×cs−cpuccpu
+ te−disk−in + te−disk−out + te−disk−spill + te−network−in + te−network−out
€
te−disk−in = ts−disk−in ×cs−disk−readcdisk−read
×sdisk−inss−disk−in
€
sdisk−spill =ss−disk−spillSs−in
× sin
€
snetwork−out =sout
Ntotal−reduce
=βs−oi−ratio × sinNtotal−reduce
15
• Schedulingpolicies– MapTasks• ShortestJobFirst(SJF)• Starva>onoflongrunningjobs:addressedbyperiodicalre‐sampling
– ReduceTasks• NaïveI/OBiasing
– DonotscheduleI/OintensivejobonI/ObusynodewhenthereareotherreduceslotswithhigherdiskI/Oavailability
– I/Ointensivejob:judgedusingmapphasesample– I/Obusynode:diskI/Oscoresbelowclusteraverage
DesignDetails
16
Implementa>on
ResourceScheduler
MapTaskFinishTimeEs>mator
MapSampleReportLogger
HashMap<JobID,MapSampleReport>
JobTrackerTaskTracker
TaskTrackerStatus
ResourceStatus
ResourceCalculatorPlugin
TaskInProgress
Task
MyJobInProgressJobInProgress
TaskStatus
SampleTaskStatus
Jobprofiles
ResourceProfiles
ResourceScores Sampletaskprocessing>me&datasizes
Es>matedtaskprocessing>me
hfps://github.com/weilu/Hadoop‐Resource‐Aware‐Scheduler 17
Evalua>on&Results
18
Es>ma>onAccuracy
• ClusterConfigura>onI– Sharedwithotherusersandotherapplica>ons– 1master,10slavenodes– 1Gbpsnetwork,samerack– Eachnode:
• 4processors:IntelXeonE5607QuadCoreCPU(2.26GHz),• 32GBmemory,and• 1TBharddisk
• HadoopConfigura>on– HDFSblocksize:64MB– Datareplica>on:1– Eachnode:
• Mapslots:1• Reduceslots:2
– Specula>vemap&reducetasks:off– Completedmapsrequiredbeforeschedulingreduce:1outof1000totalmaps
19
Es>ma>onAccuracy
• Workloaddescrip>on:– I/Oworkload:wordcount
• Countstheoccurrenceofeachwordingiveninputfiles• Mapper:Scansthroughtheinput;outputseachwordwithitselfasthekeyand1asthevalue,sortedonthekeyvalue.
• Reducer:Collectsthosewiththesamekeybyaddingupthevalue;outputsthekeyandtotaloccurrence
– CPUworkload:pies>ma>on• Approximatethevalueofpibycoun>ngthenumberofpointsthatfallwithintheunitquartercircle
• Mapper:Readscoordinatesofpoints;countspointsinside/outsideoftheinscribedcircleofthesquare.
• Reducer:Accumulatesnumbersofpointsinside/outsideresultsfromthemappers
20
Es>ma>onAccuracy
• I/OWorkload1
0
20000
40000
60000
80000
100000
120000
140000
160000
Es?matedvs.ActualTaskExecu?onTime(ResourceScheduler,wordcount,10node,5Gindata,singlejob)
es>mate actual
21
Es>ma>onAccuracy
• I/OWorkload2
20000
25000
30000
35000
40000
45000
Es?matedvs.ActualTaskExecu?onTime(ResourceScheduler,wordcount,10node,5Gindata,singlejob)
es>mate actual
22
Es>ma>onAccuracy
• CPUWorkload1
0
1000
2000
3000
4000
5000
6000
ResourceSchedulerpi(10node,100maps,108pointseach,Singlejob)
es>mated actual
23
Es>ma>onAccuracy
• CPUWorkload2
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
ResourceSchedulerpi(10node,100maps,109pointseach,Singlejob)
es>mated actual
24
• ClusterConfigura>onII(DifftoConfigura>onI)– Reservedandunshared– 1master,5slavenodes
• WorkloadDescrip>on– SingleI/Ojob:wordcount– SingleCPUjob:pies>ma>on– SimultaneoussubmissionofI/OjobandCPUjob
PerformanceBenchmark:ResourceSchedulervs.FIFOScheduler
OverheadEvalua>on
Baselineestablishment:realitytest
25
PerformanceBenchmark:ResourceSchedulervs.FIFOSchedulerResource‐HomogeneousEnvironment• OverheadEvalua>on
Table9–evalua?onandresults:wordcountinresource‐homogeneousenvironment3runs(summary)
Table10–evalua?onandresults:pies?ma?oninresource‐homogeneousenvironment3runs(summary)
26
• FIFOvsResourceSchedulerinaResource‐HomogeneousEnvironment
PerformanceBenchmark:ResourceSchedulervs.FIFOScheduler
27
PerformanceBenchmark:ResourceSchedulervs.FIFOScheduler• Analysis
– Negligibleoverhead– ResourceSchedulerperforms
worse:slowdowninallmeasureddimensionsandcase
– Reason:Resourceschedulerhasmoreconcurrentrunningreducerscompe>ngforresources
– Expect:Sameperformanceinabusycluster(allreduceslotsareconstantlyfilledwithrunningtasks)
1200
1250
1300
1350
1400
1450
1500
1550
1600
1650
1700
FIFO Resource FIFO Resource
totalmap>me(sec) totaljob>me(sec)
FIFOvsResourceSchedulerinaResource‐HomogeneousEnvironment
(SimultaneoussubmissionofanI/OjobandaCPUjob)
worst
average
best
28
PerformanceBenchmark:ResourceSchedulervs.FIFOSchedulerResource‐HeterogeneousEnvironment
• EnvironmentSimula>on– CPUinterven>on:Non‐MapReducePies>ma>on– DiskI/Ointerven>on:dd50Gwrite‐read
• SimulatedEnvironment– 3CPUbusynodes+2DiskIObusynodes 29
• FIFOvsResourceSchedulerinaResource‐HeterogeneousEnvironment(Sequen>alsubmissionof2jobs)
PerformanceBenchmark:ResourceSchedulervs.FIFOScheduler
30
• FIFOvsResourceSchedulerinaResource‐HeterogeneousEnvironment(Concurrentsubmissionof2jobs)
PerformanceBenchmark:ResourceSchedulervs.FIFOScheduler
31
PerformanceBenchmark:ResourceSchedulervs.FIFOScheduler
1200
1350
1500
1650
1800
1950
2100
2250
2400
2550
2700
FIFO Resource FIFO Resource
Totalmap>me(sec) Totaljob>me(sec)
FIFOvsResourceSchedulerinaResource‐HeterogeneousEnvironment
(SimultaneoussubmissionofanI/OjobandaCPUjob)
worst
average
best
0.00%2.00%4.00%6.00%8.00%10.00%12.00%14.00%16.00%
Best Average Worst
Totalmap?mepercentageslowdownofresourcetoFIFOscheduler
homogenousenvironment
heterogenousenvironment
‐4.00%‐2.00%0.00%2.00%4.00%6.00%8.00%10.00%12.00%14.00%16.00%18.00%20.00%
Best Average Worst
Totaljob?mepercentageslowdownofresourcetoFIFOscheduler
homogenousenvironment
heterogenousenvironment
32
Conclusion
• Resourcebasedmaptaskprocessing>mees>ma>onissa>sfactory• ResourceschedulerdidnotmanagetooutperformFIFOscheduler
inresource‐homogenousenvironmentandmostcasesofresourceheterogeneousenvironmentduetoextraconcurrentreducetasks
• Howeverweverifiedthatresourceschedulerisindeedresourceaware–itperformsbeferwhenmovedfromaresource‐homogeneousenvironmenttoaresource‐heterogeneousenvironment:– SmallerpercentageslowdowncomparedtoFIFOinallcasesandall
measureddimensions– ObservedspeedupcomparedtoFIFOinworsecasesduetoI/Obiasing
schedulingduringreducestage
33
Recommenda>onsforFutureWork
• Evalua>on– Heavierworkload&busycluster
• Observeoverhead• Benchmarkperformance
• Schedulingpolicy– MapTask
• HighestResponseRa>oNext(HRRN)
– ReduceTask• CPUBiasingforCPUintensivejobs
€
priority =testimated + twaiting
testimated= 1+
twaitingtestimated
34