Dec 26, 2015
Talk outline
• Why is dataflow so useful?• What is Dryad?• An engineering sweet spot• Beyond Dryad• Conclusions
Computation on large datasets• Performance mostly efficient resource use• Locality
• Data placed correctly in memory hierarchy
• Scheduling• Get enough work done before being interrupted
• Decompose into independent batches• Parallel computation
• Control communication and synchronization
• Distributed computation• Writes must be explicitly shared
Computational model• Vertices are independent• State and scheduling
• Dataflow very powerful• Explicit batching and communication
Processingvertices
Channels
Inputs
Outputs
Why dataflow now?
• Collection-oriented programming model• Operations on collections of objects• Turn spurious (unordered) for into foreach• Not every for is foreach
• Aggregation (sum, count, max, etc.)• Grouping• Join, Zip
• Iteration
• LINQ since ca 2008, now Spark via Scala, Java
int SortKey(KeyValuePair<string,int> x){ return x.count;}
int SortKey(void* x){ return (KeyValuePair<string,int>*)x->count;}
Given some lines of text, find the most commonly occurring words.
1. Read the lines from a file2. Split each line into its constituent words3. Count how many times each word appears4. Find the words with the highest counts
1. var lines = FS.ReadAsLines(inputFileName);2. var words = lines.SelectMany(x => x.Split(‘ ‘));3. var counts = words.CountInGroups();4. var highest =
counts.OrderByDescending(x => x.count).Take(10);
Type inference
Collection<KeyValuePair<string,int>>
Lambda expressions
Generics and extension methods
FooCollection FooTake(FooCollection c, int count) { … }
Well-chosen syntactic sugar
red,2blue,4
yellow,3
red
red
blue
blueblue blueyellow
yellowyellow
Collection<T> Take(this Collection<T> c, int count) { … }
Collections compile to dataflow• Each operator specifies a single data-parallel step• Communication between steps explicit• Collections reference collections, not individual objects!• Communication under control of the system
• Partition, pipeline, exchange automatically
• LINQ innovation: embedded user-defined functions var words = lines.SelectMany(x => x.Split(‘ ‘));
• Very expressive• Programmer ‘naturally’ writes pure functions
Distributed sortingvar sorted = set.OrderBy(x => x.key)
range partition by key
sort locally
sorted
set
sample
compute histogram
Quiet revolution in parallelism• Programming model is more attractive• Simpler, more concise, readable, maintainable
• Program is easier to optimize• Programmer separates computation and communication• System can re-order, distribute, batch, etc. etc.
Talk outline
• Why is dataflow so useful?• What is Dryad?• An engineering sweet spot• Beyond Dryad• Conclusions
What is Dryad?
• General-purpose DAG execution engine ca 2005• Cited as inspiration for e.g. Hyracks, Tez
• Engine behind Microsoft Cosmos/SCOPE• Initially MSN Search/Bing, now used throughout MSFT
• Core of research batch cluster environment ca 2009• DryadLINQ• Quincy scheduler• TidyFS
What Dryad does
• Abstracts cluster resources• Set of computers, network topology, etc.
• Recovers from transient failures• Rerun computations on machine or network fault• Speculate duplicates for slow computations
• Schedules a local DAG of work at each vertex
Scheduling and fault tolerance• DAG makes things easy• Schedule from source to sink in any order• Re-execute subgraph on failure• Execute “duplicates” for slow vertices
Scheduling and fault tolerance• DAG makes things easy• Schedule from source to sink in any order• Re-execute subgraph on failure• Execute “duplicates” for slow vertices
Scheduling and fault tolerance• DAG makes things easy• Schedule from source to sink in any order• Re-execute subgraph on failure• Execute “duplicates” for slow vertices
Scheduling and fault tolerance• DAG makes things easy• Schedule from source to sink in any order• Re-execute subgraph on failure• Execute “duplicates” for slow vertices
Resources are virtualized
• Each graph vertex is a process• Writes outputs to disk (usually)• Reads inputs from upstream nodes’ output files
• Graph generally larger than cluster RAM• 1TB partitioned input, 250MB part size, 4000 parts
• Cluster is shared• Don’t size program for exact cluster• Use whatever share of resources are available
Integrated system
• Collection-oriented programming model (LINQ)• Partitioned file system (TidyFS)• Manages replication and distribution of large data
• Cluster scheduler (Quincy)• Jointly schedule multiple jobs at a time• Fine-grain multiplexing between jobs• Balance locality and fairness
• Monitoring and debugging (Artemis)• Within job and across jobs
Dryad Cluster Scheduling
R
Scheduler
Dryad Cluster Scheduling
R
R
Scheduler
Quincy without preemption
Quincy with preemption
Dryad features
• Well-tested at scales up to 15k cluster computers• In heavy production use for 8 years
• Dataflow graph is mutable at runtime• Repartition to avoid skew• Specialize matrices dense/sparse• Harden fault-tolerance
Talk outline
• Why is dataflow so useful?• What is Dryad?• An engineering sweet spot• Beyond Dryad• Conclusions
Stateless DAG dataflow
• MapReduce, Dryad, Spark, …• Stateless vertex constraint hampers performance• Iteration and streaming overheads
• Why does this design keep repeating?
Software engineering
• Fault tolerance well understood• E.g., Chandy-Lamport, rollback recovery, etc.
• Basic mechanism: checkpoint plus log• Stateless DAG: no checkpoint!• Programming model “tricked” user• All communication on typed channels• Only channel data needs to be persisted• Fault tolerance comes without programmer effort• Even with UDFs
Talk outline
• Why is dataflow so useful?• What is Dryad?• An engineering sweet spot• Beyond Dryad• Conclusions
What about stateful dataflow?• Naiad• Add state to vertices• Support streaming and iteration
• Opportunities• Much lower latency• Can model mutable state with dataflow
• Challenges• Scheduling• Coordination• Fault tolerance
Batch processing
Stream processing
Graph processing
Timely dataflow
Batching Streamingvs.
Requires coordination Supports aggregation
No coordination needed Aggregation is difficult
(synchronous) (asynchronous)
Batch DAG execution
Centralcoordinator
Streaming DAG execution
Streaming DAG execution
Inlinecoordination
Batch iteration
Centralcoordinator
Streaming iteration
Messages
B C D
B.SENDBY(edge, message, time)
C.ONRECV(edge, message, time)
Messages are delivered asynchronously
Notifications
B C D
D.NOTIFYAT(time)
D.ONNOTIFY(time)
Notifications support batching
C.SENDBY(_, _, time)
No more messages at time or earlierD.ONRECV(_, _, time)
Coordination in timely dataflow• Local scheduling with global progress tracking• Coordination with a shared counter, not a scheduler• Efficient, scalable implementation
32K tweets/s
10 queries/s
Interactive graph analysis
In ⋈
#x
@y
z?
⋈max
⋈
Query latency
30000 35000 40000 45000 500001
10
100
1000
Experiment time (s)
Query
late
ncy
(m
s)
32 8-core 2.1 GHz AMD Opteron16 GB RAM per serverGigabit Ethernet
Max: 140 ms99th percentile: 70 msMedian: 5.2 ms
Mutable state
• In batch DAG systems collections are immutable• Functional definition in terms of preceding subgraph
• Adding streaming or iteration introduces mutability• Collection varies as function of epoch, loop iteration
Key-value store as dataflowvar lookup = data.join(query, d => d.key, q => q.key)
• Modeled random access with dataflow… • Add/remove key is streaming update to data• Look up key is streaming update to query
• High throughput requires batching• But that was true anyway, in general
What can’t dataflow do?
• Programming model for mutable state?• Not as intuitive as functional collection manipulation
• Policies for placement still primitive• Hash everything and hope
• Great research opportunities• Intersection of OS, network, runtime, language
Talk outline
• Why is dataflow so useful?• What is Dryad?• An engineering sweet spot• Beyond Dryad• Conclusions
Conclusions
• Dataflow is a great structuring principle• We know good programming models• We know how to write high-performance systems
• Dataflow is the status quo for batch processing• Mutable state is the current research frontier
Apache 2.0 licensed source on GitHubhttp://research.microsoft.com/en-us/um/siliconvalley/projects/BigDataDev/