"HFSP: Size-based Scheduling for Hadoop" presentation for BigData 2014

HFSP: Size-based Scheduling for Hadoop

Mario Pastorelli∗ Antonio Barbuzzi∗ Matteo Dell’Amico∗

Damiano Carra† Pietro Michiardi∗

∗EURECOM, France

†University of Verona, Italy

IEEE BigData 2013

Mario Pastorelli et al. (EURECOM) HFSP: Size-based Scheduling for Hadoop IEEE BigData 2013 1 / 15

Why a new scheduler?

Focus on short system response timesheterogeneous workloads [VLDB12,VLDB13,SOCC13]

big differences in jobs sizesdata exploration, preliminary analyses, algorithm tuning, orchestrationjobs. . .

Current schedulers need manual setupfine-tuning of the scheduler parametersconfiguration of pools of jobscomplex, error prone and difficult to adapt to workload/clusterchanges


Why a new scheduler?

Focus on short system response timesheterogeneous workloads [VLDB12,VLDB13,SOCC13]

big differences in jobs sizesdata exploration, preliminary analyses, algorithm tuning, orchestrationjobs. . .

Current schedulers need manual setupfine-tuning of the scheduler parametersconfiguration of pools of jobscomplex, error prone and difficult to adapt to workload/clusterchanges


Size-based schedulers

Size-based schedulers are more efficient than other schedulers

job priority based on the job sizefocus resources on a few jobs instead of splitting them among manyjobs. . . but the job size is required

MapReduce is suitable for size-based scheduling

we don’t have the job size but we have the time to estimate itno perfect estimation is required . . .

. . . as long as the jobs very different are sorted correctly


Size-based schedulers

Size-based schedulers are more efficient than other schedulers

job priority based on the job sizefocus resources on a few jobs instead of splitting them among manyjobs. . . but the job size is required

MapReduce is suitable for size-based scheduling

we don’t have the job size but we have the time to estimate itno perfect estimation is required . . .

. . . as long as the jobs very different are sorted correctly


Size-based schedulers: example

Job Arrival Time Sizejob1 0s 30sjob2 10s 10sjob3 15s 10s

Scheduler AVG sojourn timeProcessor Share 35sSRPT 25s

Processor

Share

SRPT


Size-based schedulers: example

Job Arrival Time Sizejob1 0s 30sjob2 10s 10sjob3 15s 10s

Scheduler AVG sojourn timeProcessor Share 35sSRPT 25s

Processor

Share

SRPT


Hadoop Fair Sojourn Protocol

Like SRPT, HFSP wants to be efficient but it avoids starvation

How: Shortest Remaining Virtual Time first (SRVT)

Each job has a virtual size based on the real oneVirtual size decreases with timeJobs are scheduled by ascending virtual size


Hadoop Fair Sojourn Protocol: challenges

Job size estimation

Virtual size and aging

Task scheduling policy


Job size estimation (1/2)

Two ways to estimate a job size:Offline: based on the informations available a priori (num tasks, blocksize, past history . . . ):

available since job submissionnot very precise

Online: based on the performance of a subset of tasks:

need time for trainingmore precise

We need both:

Offline estimation for the initial size, because jobs need size since theirsubmissionOnline estimation because it is more precise: when it is completed, thejob size is updated



Two ways to estimate a job size:Offline: based on the informations available a priori (num tasks, blocksize, past history . . . ):

available since job submissionnot very precise

Online: based on the performance of a subset of tasks:

need time for trainingmore precise

We need both:

Offline estimation for the initial size, because jobs need size since theirsubmissionOnline estimation because it is more precise: when it is completed, thejob size is updated



Implementation details:

Online estimation is done while the job progresses, no work is wastedEstimation technique: first-order statistics are good enoughThe Map and Reduce phases of a job are treated as independent

Further details in the paper . . .


Virtual size and aging

Like SRPT, HFSP wants to be efficient but it avoids starvation

How:

Each job has a “virtual” sizeA “virtual” Fair Scheduler lets each job make virtual progressWe use virtual job sizes to take scheduling decision in the real cluster

→ Priority to small jobs→ Every job eventually gets small, hence no starvation


Task scheduling policy

When a task slot becomes free:

Schedule a task for online estimation, if any

otherwise, schedule a task from the highest priority job


Experimental Setup

Task Trackers 36CPUs Task Tracker 4RAM Task Tracker 8 GBMap slots 72Reduce slots 36

Network speed: 1Gbps

Using PigMix jobs

Two kinds of workloadsinspired by existing traces

Dataset size Map tasksWorkload

SMALL LARGE

1 GB < 5 65% 0%10 GB 10− 50 20% 10%40 GB 50− 150 10% 60%

100 GB > 150 5% 30%


Experimental Setup

Task Trackers 36CPUs Task Tracker 4RAM Task Tracker 8 GBMap slots 72Reduce slots 36

Network speed: 1Gbps

Using PigMix jobs

Two kinds of workloadsinspired by existing traces

Dataset size Map tasksWorkload

SMALL LARGE

1 GB < 5 65% 0%10 GB 10− 50 20% 10%40 GB 50− 150 10% 60%

100 GB > 150 5% 30%


Results

SMALL

101 102 103

Sojourn Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

EC

DF

HFSPFAIR

Same performance for tiny jobs

Large difference for other jobs

Mean sojourn time descreased by16% using HFSP

LARGE

101 102 103 104

Sojourn Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

EC

DF

HFSPFAIR

Jobs completed after 100 seconds:Fair: 2% jobs HFSP: 30% jobs



Results

SMALL

101 102 103

Sojourn Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

EC

DF

HFSPFAIR

Same performance for tiny jobs

Large difference for other jobs

Mean sojourn time descreased by16% using HFSP

LARGE

101 102 103 104

Sojourn Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

EC

DF

HFSPFAIR




Experiments: task times and estimation errors

Task times are skewed

10% of the Reducers are muchlonger than other tasks

100 101 102 103 104

Task Time

0.0

0.2

0.4

0.6

0.8

1.0

EC

DF

MAP

REDUCE

0.25 0.5 1 2 4Error

0.0

0.2

0.4

0.6

0.8

1.0

EC

DF

MAP

REDUCE error = est. sizereal size

∼60% jobs are over estimated

impact of the over-estimation ismitigated by the aging function


Experiments: task times and estimation errors

Task times are skewed

10% of the Reducers are muchlonger than other tasks

100 101 102 103 104

Task Time

0.0

0.2

0.4

0.6

0.8

1.0

EC

DF

MAP

REDUCE

0.25 0.5 1 2 4Error

0.0

0.2

0.4

0.6

0.8

1.0

EC

DF

MAP

REDUCE error = est. sizereal size

∼60% jobs are over estimated

impact of the over-estimation ismitigated by the aging function


Conclusions

HFSP strives for efficiency and avoids starvation

Particularly suitable for loaded clusters

Requires no manual, per-job priorities

→ heterogeneous workloads can coexist in the same cluster

HFSP developed within the BigFoot project

Available at: https://github.com/bigfootproject/HFSP


https://github.com/bigfootproject/HFSP

Thank you!

@mariopastorelli @BigFoot project


"HFSP: Size-based Scheduling for Hadoop" presentation for BigData 2014

Engineering

sizebased scheduling

block size

initial size

job size focus resources

virtual size decreases

eurecom hfsp

hadoop ieee bigdata

job submission