Page 1
Section Slide Template Option 2
Put your subtitle here. Feel free to pick from the handful of pretty Google colors available to you.
Make the subtitle something clever. People will think it’s neat.
Dataflow A Unified Model for Batch and Streaming Data Processing
Vadim SoloveyGoogle Developer Expert & [email protected]
Shahar FrankCloud Solutions [email protected]
Page 2
Goals:
Write interesting computations
Run in both batch & streaming
Use custom timestamps
Handle late data
https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg
Page 3
Data Shapes
Google’s Data Processing Story
The Dataflow Model
Agenda
Demo: Google Cloud Dataflow
1
2
3
4
Page 7
...really, really big...
TuesdayWednesday
Thursday
Page 8
...maybe even infinitely big...
9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00
Page 9
… with unknown delays.
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
8:00
Page 10
1 + 1 = 2Completeness Latency Cost
$$$
Data Processing Tradeoffs
Page 11
Requirements: Billing Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Page 12
Requirements: Live Cost Estimate Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Page 13
Requirements: Abuse Detection Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Page 14
Requirements: Abuse Detection Backfill Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Page 15
Google’s Data Processing Story2
Page 16
20122002 2004 2006 2008 2010
MapReduce
GFS Big Table
Dremel
Pregel
FlumeJava
Colossus
Spanner
2014
MillWheel
Dataflow
2016
Google’s Data-Related Papers
Page 17
(Produce)
MapReduce: Batch Processing
(Prepare)
(Shuffle)
Map
Reduce
Page 18
FlumeJava: Easy and Efficient MapReduce Pipelines
● Higher-level API with simple data processing abstractions.○ Focus on what you want to do to
your data, not what the underlying system supports.
● A graph of transformations is automatically transformed into an optimized series of MapReduces.
Page 19
MapReduce
Batch Patterns: Creating Structured Data
Page 20
MapReduce
Batch Patterns: Repetitive Runs
TuesdayWednesday
Thursday
Page 21
MapReduce
Tuesday [11:00 - 12:00)
[12:00 - 13:00)
[13:00 - 14:00)
[14:00 - 15:00)
[15:00 - 16:00)
[16:00 - 17:00)
[18:00 - 19:00)
[19:00 - 20:00)
[21:00 - 22:00)
[22:00 - 23:00)
[23:00 - 0:00)
Batch Patterns: Time Based Windows
Page 22
MapReduce
TuesdayWednesday
Batch Patterns: Sessions
Jose
Lisa
Ingo
Asha
Cheryl
Ari
WednesdayTuesday
Page 23
MillWheel: Streaming Computations
● Framework for building low-latency data-processing applications
● User provides a DAG of computations to be performed
● System manages state and persistent flow of elements
Page 24
Streaming Patterns: Element-wise transformations
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
Page 25
Streaming Patterns: Aggregating Time Based Windows
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
Page 26
11:0010:00 15:0014:0013:0012:00Event Time
11:0010:00 15:0014:0013:0012:00Processing Time
Input
Output
Streaming Patterns: Event Time Based Windows
Page 27
Streaming Patterns: Session Windows
Event Time
Processing Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Page 28
Proc
essi
ng T
ime
Event Time
Skew
Event-Time Skew
Watermark Watermarks describe event time progress.
"No timestamp earlier than the watermark will be seen"
Often heuristic-based.
Too Slow? Results are delayed.Too Fast? Some data is late.
Page 29
Streaming or Batch?
1 + 1 = 2 $$$Completeness Latency Cost
Why not both?
Page 30
The Dataflow Model3
Page 31
What are you computing?
Where in event time?
When in processing time?
How do refinements relate?
Page 32
What are you computing?
• A Pipeline represents a graph of data processing transformations
• PCollections flow through the pipeline
• Optimized and executed as a unit for efficiency
Page 33
What are you computing? • A PCollection<T> is a collection
of data of type T
• Maybe be bounded or unbounded in size
• Each element has an implicit timestamp
• Initially created from backing data stores
Page 34
What are you computing?
PTransforms transform PCollections into other PCollections.
What Where When How
Element-Wise Aggregating Composite
Page 35
Example: Computing Integer Sums
// Collection of raw log linesPCollection<String> raw = ...;
// Element-wise transformation into team/score pairsPCollection<KV<String, Integer>> input = raw.apply(ParDo.of(new ParseFn()))
// Composite transformation containing an aggregationPCollection<KV<String, Integer>> output = input .apply(Sum.integersPerKey());
What Where When How
Page 36
Example: Computing Integer Sums
What Where When How
Page 37
What Where When How
Example: Computing Integer Sums
Page 38
Key 2
Key 1
Key 3
1
Fixed
2
3
4
Key 2
Key 1
Key 3
Sliding
123
54
Key 2
Key 1
Key 3
Sessions
2
43
1
Where in Event Time?
• Windowing divides data into event-time-based finite chunks.
• Required when doing aggregations over unbounded data.
What Where When How
Page 39
PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(Minutes(2)))) .apply(Sum.integersPerKey());
What Where When How
Example: Fixed 2-minute Windows
Page 40
What Where When How
Example: Fixed 2-minute Windows
Page 41
What Where When How
When in Processing Time?
• Triggers control when results are emitted.
• Triggers are often relative to the watermark.Pr
oces
sing
Tim
e
Event Time
Watermark
Page 42
PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark()) .apply(Sum.integersPerKey());
What Where When How
Example: Triggering at the Watermark
Page 43
What Where When How
Example: Triggering at the Watermark
Page 44
What Where When How
Example: Triggering for Speculative & Late Data
PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());
Page 45
What Where When How
Example: Triggering for Speculative & Late Data
Page 46
What Where When How
How do Refinements Relate?• How should multiple outputs per window
accumulate?• Appropriate choice depends on consumer.
Firing Elements
Speculative 3
Watermark 5, 1
Late 2
Total Observ 11
Discarding
3
6
2
11
Accumulating
3
9
11
23
Acc. & Retracting
3
9, -3
11, -9
11
Page 47
PCollection<KV<String, Integer>> output = input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetracting()) .apply(new Sum());
What Where When How
Example: Add Newest, Remove Previous
Page 48
What Where When How
Example: Add Newest, Remove Previous
Page 49
1. Classic Batch 2. Batch with Fixed Windows
3. Streaming 5. Streaming with Retractions
4. Streaming with Speculative + Late Data
Customizing What Where When How
What Where When How
Page 50
Demo: Google Cloud Dataflow4
Page 51
Cloud DataflowA fully-managed cloud service and programming model for batch and streaming big data processing.
Google Cloud Dataflow
Page 52
Open Source SDKs
● Used to construct a Dataflow pipeline.
● Java available now. Python in the open beta.
● Pipelines can run…○ On your development machine○ On the Dataflow Service on Google Cloud Platform ○ On third party environments like Spark or Flink.
Page 53
Fully Managed Dataflow Service
Runs the pipeline on Google Cloud Platform. Includes:
● Graph optimization: Modular code, efficient execution● Smart Workers: Lifecycle management, Autoscaling, and
Smart task rebalancing ● Easy Monitoring: Dataflow UI, Restful API and CLI,
Integration with Cloud Logging, etc.
Page 55
user11_BisqueEmu,BisqueEmu,11,1460980305000,2016-04-18 04:59:27.936user0_BisqueQuokka,BisqueQuokka,15,1460980768000,2016-04-18 04:59:28.473user11_ArmyGreenNumbat,ArmyGreenNumbat,5,1460980768000,2016-04-18 04:59:28.473user4_AquaEmu,AquaEmu,19,1460980768000,2016-04-18 04:59:28.473user5_BeigeDingo,BeigeDingo,16,1460980768000,2016-04-18 04:59:28.473user1_ArmyGreenKookaburra,ArmyGreenKookaburra,9,1460980768000,2016-04-18 04:59:28.473user7_AquaDingo,AquaDingo,18,1460980768000,2016-04-18 04:59:28.473user7_AmberEmu,AmberEmu,11,1460980768000,2016-04-18 04:59:28.473user1_BeigeDingo,BeigeDingo,6,1460980768000,2016-04-18 04:59:28.473user2_ArmyGreenKookaburra,ArmyGreenKookaburra,8,1460980768000,2016-04-18 04:59:28.473user10_AzureBandicoot,AzureBandicoot,12,1460980768000,2016-04-18 04:59:28.473user1_AzureBandicoot,AzureBandicoot,12,1460980768000,2016-04-18 04:59:28.473Robot-12,BeigeQuokka,12,1460980768000,2016-04-18 04:59:28.473user3_MagentaAntechinus,MagentaAntechinus,5,1460980768000,2016-04-18 04:59:28.473user6_AmberEmu,AmberEmu,15,1460980768000,2016-04-18 04:59:28.473
Data
Page 56
Wrote Reusable PTransforms
Ran the PTransform in streaming mode
Used event time, not processing time
Easily adjusted windowing and late data settings
Page 57
Learn More!
● The Dataflow Model @VLDB 2015 http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
● Dataflow SDK for Javahttps://github.com/GoogleCloudPlatform/DataflowJavaSDK
● Google Cloud Dataflow on Google Cloud Platformhttp://cloud.google.com/dataflow (Free Trial!)