NEPTUNE Scheduling Suspendable Tasks for Unified Stream/Batch Applications SoCC, Santa Cruz, California, November 2019 Panagiotis Garefalakis Imperial College London [email protected]Konstantinos Karanasos Microsoft [email protected]Peter Pietzuch Imperial College London [email protected]
21
Embed
NEPTUNE - acmsocc.github.io · Microsoft [email protected] Peter Pietzuch Imperial College London [email protected]. Unified application example Panagiotis Garefalakis-Imperial
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Panagiotis Garefalakis - Imperial College London 2
Inference Job
Low-latencyresponses
TrainedModel
Historical data
Real-time data
Training Job
Iterate
Stream
Batch Application
Evolution of analytics frameworks
Panagiotis Garefalakis - Imperial College London 3
Batch frameworks
20142010 2018
Frameworks with hybrid
stream/batch applicationsStream frameworks
Unified stream/batch frameworks
Structured Streaming
Requirements> Latency: Execute inference job with minimum delay> Throughput: Batch jobs should not be compromised> Efficiency: Achieve high cluster resource utilization
Stream/Batch application requirements
Panagiotis Garefalakis - Imperial College London 4
Challenge: schedule stream/batch jobs to satisfy their diverse requirements
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 5
2xTInference (stream) Job 2xT
3T TTraining (batch) Job
Stage1
T
Stage2
T2x 2x
3T3T3T
Stage1
TT
Stage24x 3x
ApplicationCode
Driver
DAG Scheduler
submitApp Contextrun job
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 6
2xTInference (stream) Job 2xT
3T TTraining (batch) Job
3T
3T
3T
T T T T
4T
3T
exec
utor
1ex
ecut
or 2
8T
T
T
TWasted
resourcesCor
es
2T 6T
Stage1
T
Stage2
T2x 2x
3T3T3T
Stage1
TT
Stage24x 3x
> Static allocation: dedicate resources to each job
Resources can not be shared across jobs
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 7
2xT 2xT
3T T
4T 8T2T 6T
Stage1
T
Stage2
T2x 2x
3T3T3T
Stage1
TT
Stage24x 3x
> FIFO: first job runs to completion
3T
3T
3T
3T
T
T
T
T T
T
Long batch jobs increase stream job latency
Cor
es
T
Inference (stream) Job
Training (batch) Jobsh
ared
exe
cuto
rs
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 8
2xT 2xT
3T T
4T 8T2T 6T
Stage1
T
Stage2
T2x 2x
3T3T3T
Stage1
TT
Stage24x 3x
> FAIR: weight share resources across jobs
Cor
es
3T
3T
3T
3T
T
T
T
T
T
T
T
queuingBetter packing with non-optimal latency
Inference (stream) Job
Training (batch) Jobsh
ared
exe
cuto
rs
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 9
2xT 2xT
3T T
4T 8T2T 6T
Stage1
T
Stage2
T2x 2x
3T3T3T
Stage1
TT
Stage24x 3x
> KILL: avoid queueing by preempting batch tasks
Cor
es
3T
3T
3T
3T
T
T
T
T
T
T 3T
T 3T
Better latency at the expense of extra work
Inference (stream) Job
Training (batch) Jobsh
ared
exe
cuto
rs
Stream/Batch application scheduling
Panagiotis Garefalakis - Imperial College London 10
2xT 2xT
3T T
4T 8T2T 6T
Stage1
T
Stage2
T2x 2x
3T3T3T
Stage1
TT
Stage24x 3x
> NEPTUNE: minimize queueing and wasted work!
Cor
esInference (stream) Job
Training (batch) Jobsh
ared
exe
cuto
rs
3T
3T
3T
3T
T
T
T
T
T
2T
2TT
T
> How to minimize queuing for latency-sensitive jobs and wasted work?Implement suspendable tasks
> How to natively support stream/batch applications?Provide a unified execution framework
> How to satisfy different stream/batch application requirements and high-level objectives?Introduces custom scheduling policies
Challenges
Panagiotis Garefalakis - Imperial College London 11
> How to minimize queuing for latency-sensitive jobs and wasted work?Implement suspendable tasks
> How to natively support stream/batch applications?Provide a unified execution framework
> How to satisfy different stream/batch application requirements and high-level objectives?Introduces custom scheduling policies
NEPTUNEExecution framework for Stream/Batch applications
Panagiotis Garefalakis - Imperial College London 12
Support suspendable tasks
Introduce pluggable scheduling policies
Unified execution framework on top ofStructured Streaming
Typical tasks
Panagiotis Garefalakis - Imperial College London 13
ExecutorStack
Task run
Value
Context
Iterator
Function
> Tasks: apply a function to a partition of data
> Subroutines that run in executor to completion
> Preemption problem: > Loss of progress (kill)> Unpredictable preemption times
(checkpointing)
State
Suspendable tasks
Panagiotis Garefalakis - Imperial College London 14
Function
Context
Iterator
Coroutine Stack
callyield
> Idea: use coroutines> Separate stacks to store task
Panagiotis Garefalakis - Imperial College London 15
> Idea: centralized scheduler with pluggable policies
> Problem: not just assign but also suspend and resume
ExecutorExecutorDAG scheduler
Task Scheduler
Scheduling policy
ExecutorTasks
Low-pri job High-pri job
Running Paused
suspend & run task
App + job prioritiesLowHigh
Tasks
Incr
emen
taliz
er
Opt
imiz
er
launchtask
metrics
Scheduling policies
Panagiotis Garefalakis - Imperial College London 16
> Idea: policies trigger task suspension and resumption> Guarantee that stream tasks bypass batch tasks> Satisfy higher-level objectives i.e. balance cluster load> Avoid starvation by suspending up to a number of times
> Load-balancing (LB): takes into account executors’ memory conditions and equalize the number of tasks per node
> Locality- and memory aware (LMA): respect task locality preferences in addition to load-balancing
> Built as an extension to 2.4.0 (https://github.com/lsds/Neptune)
> Ported all ResultTask, ShuffleMapTask functionality across programming interfaces to coroutines
> Extended Spark’s DAG Scheduler to allow job stages with different requirements (priorities)
> Added additional Executor performance metrics as part of the heartbeat mechanism
Implementation
Panagiotis Garefalakis - Imperial College London 17