Page 1
Google Confidential and Proprietary
Omega: flexible, scalable schedulers for large compute clusters
Malte Schwarzkopf (University of Cambridge Computer Lab)
Andy Konwinski (UC Berkeley)
Michael Abd-El-Malek (Google)
John Wilkes (Google)
original: April 17th, 2013this version: Oct 2013
EuroSys 20131
Page 2
Google Confidential and Proprietary
http://www.google.com/about/datacenters/inside/locations/
We own and operate data centers around the world
Page 6
Google Confidential and Proprietary
the scheduling problem
EuroSys 2013
Tasks
Job
Machines
2
Page 7
Google Confidential and Proprietary
trends observed
Diverse workloads
Increasing cluster sizes
Growing job arrival rates
EuroSys 20133
Page 8
Google Confidential and Proprietary
why is this a problem?
EuroSys 2013
Cluster machines (10,000s)
Cluster scheduler
Arriving jobs and tasks (1,000s)
scheduling logic
4
60+ seconds!
Page 9
Google Confidential and Proprietary
why is this a problem?
EuroSys 2013
Cluster machines(10,000s)
Cluster scheduler
Arriving jobs and tasks (1,000s)
scheduling logic
Hence:Break up into independent schedulers.
Increasing complexity!
But: How do we arbitrate resources between schedulers?
Page 10
Google Confidential and Proprietary
existing approaches
static partitioning
● poor utilization● inflexible
S0 S1 S2
monolithic scheduler
SCHEDULER
● hard to diversify● code growth● scalability bottleneck
EuroSys 20136
Page 11
Google Confidential and Proprietary
existing approaches
EuroSys 2013
S0 S1 S2
CLUSTER STATE
shared-state
e.g. UCB Mesos [NSDI 2011]
● hoarding● information hiding
S0 S1 S2
RESOURCE MANAGER
two-level
7
Page 12
Google Confidential and Proprietary
how does omega work?
EuroSys 2013
S0 S1
8
Page 13
Google Confidential and Proprietary9
how does omega work?
EuroSys 2013
S0 S1
Page 14
Google Confidential and ProprietaryEuroSys 2013
how does omega work?
EuroSys 2013
S0 S1
Page 15
Google Confidential and Proprietary11
how does omega work?
EuroSys 2013
S0 S1
Conflict!
Page 16
Google Confidential and Proprietary
how does omega work?
EuroSys 2013
S0 S1
failure! success!
12
Page 17
Google Confidential and Proprietary
overview
1) intro & motivation
2) workload characterization
3) comparison of approaches
4) trace-based simulation
5) flexibility case study
EuroSys 201313
Page 18
Google Confidential and Proprietary
workload: batch/service split
Batch Service
EuroSys 201314
Page 19
Google Confidential and Proprietary
workload: batch/service split
Jobs/tasks: countsCPU/RAM: resource seconds [i.e. resource job runtime in sec.]
Cluster AMedium sizeMedium utilization
Cluster BLarge sizeMedium utilization
Cluster CMedium (12k mach.)High utilizationPublic trace
EuroSys 201315
Page 20
Google Confidential and Proprietary
workload: batch/service split
Jobs/tasks: countsCPU/RAM: resource seconds [i.e. resource job runtime in sec.]
Cluster AMedium size
Medium utilization
Cluster BLarge size
Medium utilization
Cluster CMedium size
High utilization
Public trace
EuroSys 2013
TAKEAWAY
Most jobs are batch, but most resources are consumed by service jobs.
16
Page 21
Google Confidential and Proprietary
workload: job runtime distributions
BatchService
EuroSys 2013
Frac
tion
of jo
bs ru
nnin
g fo
r les
s th
an X
Page 22
Google Confidential and Proprietary
workload: inter-arrival time distributions
ServiceBatch
Frac
tion
of in
ter-
arriv
al g
aps
less
than
X
EuroSys 2013
BatchService
Page 23
Google Confidential and Proprietary
workload: batch/service split
EuroSys 2013
Batch jobs Service jobs80th %ile runtime
80th %ile inter-arrival time
12-20 min. 29 days
4-7 sec. 2-15 min.
17
Page 24
Google Confidential and Proprietary
overview
1) intro & motivation
2) workload characterization
3) comparison of approaches
4) trace-based simulation
5) flexibility case study
EuroSys 201318
Page 25
Google Confidential and Proprietary
methodology: simulation
simulation using
empirical workload parameters distributions
EuroSys 2013
Code available:
http://code.google.com/p/cluster-scheduler-simulator
19
Page 26
Google Confidential and Proprietary
parameters
Scheduler decision time
n: num. tasks
decision time
EuroSys 2013
ttask: per-task(usually 0.005s
per task)
(usually 0.1s per job)
20
Page 27
Google Confidential and Proprietary
scheduling policies
EuroSys 201348
Why might scheduling take 60 seconds?
● Large jobs (tens of thousands of tasks)
● Optimization algorithms (constraints, bin packing with knock-on preemption)
● Picky jobs in a full cluster
● Monte Carlo simulations (fault tolerance)
Page 28
Open issues: failure toleranceTopology-aware scheduling for concurrent outages● a fault tree
Page 29
Open issues: failure toleranceTopology-aware schedulingfor concurrent outages● a fault tree● a fault DAG
Page 30
Open issues: failure toleranceTopology-aware schedulingfor concurrent outages● a fault tree● a partially redundant
fault DAG
Page 31
Open issues: failure tolerance● real fault, or
lost touch?● time to detect vs.
false positives?● multiple
information sources for correlated failures?
Page 32
Google Confidential and Proprietary
How do does the shared-state design compare with other architectures?
EuroSys 2013
Experiment details:● all clusters, 7 simulated days● 2 schedulers● varying Service scheduler
Experiment 1:
21
Page 33
Google Confidential and ProprietaryEuroSys 2013
t job for ALL
jobsttask for ALL jobs
schedulerbusyness
monolithic, uniform decision time (single logic)
time spent scheduling total time
blue => all jobs were scheduled
red => unscheduled jobs remained
22
Page 34
Google Confidential and ProprietaryEuroSys 2013
t job for se
rvice jobs
ttask for service jobs
schedulerbusyness
monolithic, fast-path batch decision time
head-of-lineblocking
23
Page 35
Google Confidential and ProprietaryEuroSys 2013
t job for se
rvice jobs
ttask for service jobs
schedulerbusyness
mesos v0.9 (of May 2012)
Ooops...
24
Page 36
Google Confidential and Proprietary
3. Blue receives tiny offer.
EuroSys 2013
mesos
S1 S2
1. Green receives offer of all available resources.
2. Blue's task finishes.
4. Blue cannot use it.
[repeat many times]
5. Green finishes scheduling.
6. Blue receives large offer.
By now, it has given up.
RESOURCE MANAGER
25
Page 37
Google Confidential and ProprietaryEuroSys 2013
t job for se
rvice jobs
ttask for service jobs
schedulerbusyness
omega, no optimizations
26
Page 38
Google Confidential and ProprietaryEuroSys 2013
t job for se
rvice jobs
ttask for service jobs
schedulerbusyness
omega, optimized
27
Page 39
Google Confidential and ProprietaryEuroSys 2013
omega, optimized
TAKEAWAY
The Omega shared-state model performs as well as a (complex) monolithic multi-path
scheduler.
Monolithic Mesos Omega
28
Page 40
Google Confidential and Proprietary
Does the shared-state design scale to many schedulers?
EuroSys 2013
Experiment details:● cluster B, 7 simulated days● 2 schedulers● varying job arrival rate and number of schedulers
Experiment 2:
29
Page 41
Google Confidential and ProprietaryEuroSys 2013
scaling to many schedulers
30
Page 42
Google Confidential and Proprietary
overview
1) intro & motivation
2) workload characterization
3) comparison of approaches
4) trace-based simulation
5) flexibility case study
EuroSys 201331
Page 43
Google Confidential and ProprietaryEuroSys 2013
simulator comparison
homogeneousmachines
lightweight simulator
high-fidelity simulator
job parameters empirical distribution
real-world
workload trace
constraints
scheduling algorithm
runtime
not supported supported
random first fitGoogle
algorithm
fast (24h ≃ 5min) slow (24h ≃ 2h)
32
Page 44
Google Confidential and Proprietary
Experiment details:● cluster C, 29 days● 2 schedulers,
non-uniform decision time● varying Service scheduler
EuroSys 2013
Experiment 3:
How much scheduler interference do we see with real Google workloads?
33
Page 45
Google Confidential and Proprietary
conflict fraction
EuroSys 2013
num. conflictstotal num. transactions
Page 46
Google Confidential and Proprietary
scheduler busyness
overhead due to conflicts
EuroSys 201334
sche
dule
r bus
ynes
s
Page 47
Google Confidential and Proprietary
scheduler busyness
overhead due to conflicts
EuroSys 2013
TAKEAWAY
Interference is higher for real-world settings.
35
Page 48
Google Confidential and Proprietary
1. Fine-grained conflict detection
optimizations
2. Incremental commits
EuroSys 201336
1st
2nd
#89693
69
Page 49
Google Confidential and Proprietary
Experiment details:● cluster C, 29 days● 2 schedulers,
non-uniform decision time● varying Service scheduler
EuroSys 2013
Experiment 4:
How do the optimizations affect performance?
37
Page 50
Google Confidential and Proprietary
impact on scheduler utilization
EuroSys 201338
Page 51
Google Confidential and Proprietary
practical implications – scheduler utilization
overhead due to conflicts
EuroSys 201339
sche
dule
r bus
ynes
s
Page 52
Google Confidential and Proprietary
practical implications – scheduler utilization
overhead due to conflicts
EuroSys 2013
TAKEAWAY
We can make simple improvements that significantly improve scalability.
40
Page 53
Google Confidential and Proprietary
Case study
MapReduce scheduler with opportunistic extra resources
EuroSys 201341
Page 54
Google Confidential and Proprietary
11
50
200
1000
8000
100 450
3
5
workers in MR jobs
Number of workers [log10]Snapshot over 29 days
Cou
nt o
f job
s w
ith X
wor
kers
Page 55
Google Confidential and Proprietary
case study: a MapReduce scheduler
EuroSys 2013cluster C, 29 days
Relative speedup [log10]
Frac
tion
of M
R jo
bs w
ith s
peed
up <
X
60% of MapReduces
43
3-4x speedup!
better
Page 56
Google Confidential and Proprietary
case study: a MapReduce scheduler
EuroSys 2013cluster C, 29 days
Relative speedup [log10]
Frac
tion
of M
R jo
bs w
ith s
peed
up <
X
60% of MapReduces
3-4x speedup!
TAKEAWAY
The Omega approach gives us the flexibility to easily support custom policies.
44
better
Page 57
Google Confidential and Proprietary
conclusion
TAKEAWAYS
Flexibility and scale require parallelism,
parallel scheduling works if you do it right, and
using shared state is the way to do it right!
EuroSys 201345
Page 58
Google Confidential and Proprietary
BACKUP SLIDES
EuroSys 2013
Page 59
Google Confidential and ProprietaryEuroSys 2013
centralized resource-allocator (not fault-tolerant)
per-job “application manager”(MapReduce calls it a “controller”)
YARN
Apache Hadoop YARN: Yet Another Resource Negotiator. ACM SoCC’13
Page 60
Google Confidential and Proprietary
methodology: simulation
empiricaldistribution
Event-driven simulator
...
Batch
Service
MapReduce
Workload
Experiment configuration
Cluster state
Initial cluster state
EuroSys 2013
Code available:
http://code.google.com/p/cluster-scheduler-simulator
19
Page 61
Google Confidential and Proprietary
Frac
tion
of jo
bs ru
nnin
g fo
r les
s th
an X
workload: job runtime distributions
BatchService
EuroSys 2013
TAKEAWAY
Service jobs, once scheduled, run for much longer than batch jobs do.
Page 62
Google Confidential and Proprietary
workload: inter-arrival time distributions
ServiceBatch
Frac
tion
of in
ter-
arriv
al g
aps
less
than
X
EuroSys 2013
TAKEAWAY
Service jobs arrive much less frequently than batch jobs do.
Page 63
Google Confidential and Proprietary
CLUSTER STATE
Shared state
the omega approach
Optimistic concurrency
S0 S1 S2● Deltas against shared state
● Easy to develop & maintain
● Heterogeneous schedulers OK
● No explicit coordination required
● Interference resolution (not prevention)
● Scales wellEuroSys 2013
Page 64
Google Confidential and Proprietary
impact on conflict fraction
EuroSys 2013
Page 65
Google Confidential and Proprietary
case study: a MapReduce scheduler
EuroSys 2013
50% of MapReduces
4.5x speedup
Relative speedup [log10]
cluster A, 29 days
Frac
tion
of M
R jo
bs w
ith s
peed
up <
X
Page 66
Google Confidential and Proprietary
caveats, or when this won't work well
● aggressive, systematically adverse workloads or schedulers
● small clusters with high overcommit
EuroSys 2013
deal with using out-of-band or post-facto enforcement mechanisms
Possible problems...