Top Banner
Section Slide Template Option 2 Put your subtitle here. Feel free to pick from the handful of pretty Google colors available to you. Make the subtitle something clever. People will think it’s neat. Dataflow A Unified Model for Batch and Streaming Data Processing Vadim Solovey Google Developer Expert & Trainer [email protected] Shahar Frank Cloud Solutions Architect [email protected]
57

Dataflow - A Unified Model for Batch and Streaming Data Processing

Apr 16, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Dataflow - A Unified Model for Batch and Streaming Data Processing

Section Slide Template Option 2

Put your subtitle here. Feel free to pick from the handful of pretty Google colors available to you.

Make the subtitle something clever. People will think it’s neat.

Dataflow A Unified Model for Batch and Streaming Data Processing

Vadim SoloveyGoogle Developer Expert & [email protected]

Shahar FrankCloud Solutions [email protected]

Page 2: Dataflow - A Unified Model for Batch and Streaming Data Processing

Goals:

Write interesting computations

Run in both batch & streaming

Use custom timestamps

Handle late data

https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg

Page 3: Dataflow - A Unified Model for Batch and Streaming Data Processing

Data Shapes

Google’s Data Processing Story

The Dataflow Model

Agenda

Demo: Google Cloud Dataflow

1

2

3

4

Page 4: Dataflow - A Unified Model for Batch and Streaming Data Processing

Data Shapes1

Page 5: Dataflow - A Unified Model for Batch and Streaming Data Processing

Data...

Page 6: Dataflow - A Unified Model for Batch and Streaming Data Processing

...can be big...

Page 7: Dataflow - A Unified Model for Batch and Streaming Data Processing

...really, really big...

TuesdayWednesday

Thursday

Page 8: Dataflow - A Unified Model for Batch and Streaming Data Processing

...maybe even infinitely big...

9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00

Page 9: Dataflow - A Unified Model for Batch and Streaming Data Processing

… with unknown delays.

9:008:00 14:0013:0012:0011:0010:00

8:00

8:008:00

8:00

Page 10: Dataflow - A Unified Model for Batch and Streaming Data Processing

1 + 1 = 2Completeness Latency Cost

$$$

Data Processing Tradeoffs

Page 11: Dataflow - A Unified Model for Batch and Streaming Data Processing

Requirements: Billing Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Page 12: Dataflow - A Unified Model for Batch and Streaming Data Processing

Requirements: Live Cost Estimate Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Page 13: Dataflow - A Unified Model for Batch and Streaming Data Processing

Requirements: Abuse Detection Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Page 14: Dataflow - A Unified Model for Batch and Streaming Data Processing

Requirements: Abuse Detection Backfill Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Page 15: Dataflow - A Unified Model for Batch and Streaming Data Processing

Google’s Data Processing Story2

Page 16: Dataflow - A Unified Model for Batch and Streaming Data Processing

20122002 2004 2006 2008 2010

MapReduce

GFS Big Table

Dremel

Pregel

FlumeJava

Colossus

Spanner

2014

MillWheel

Dataflow

2016

Google’s Data-Related Papers

Page 17: Dataflow - A Unified Model for Batch and Streaming Data Processing

(Produce)

MapReduce: Batch Processing

(Prepare)

(Shuffle)

Map

Reduce

Page 18: Dataflow - A Unified Model for Batch and Streaming Data Processing

FlumeJava: Easy and Efficient MapReduce Pipelines

● Higher-level API with simple data processing abstractions.○ Focus on what you want to do to

your data, not what the underlying system supports.

● A graph of transformations is automatically transformed into an optimized series of MapReduces.

Page 19: Dataflow - A Unified Model for Batch and Streaming Data Processing

MapReduce

Batch Patterns: Creating Structured Data

Page 20: Dataflow - A Unified Model for Batch and Streaming Data Processing

MapReduce

Batch Patterns: Repetitive Runs

TuesdayWednesday

Thursday

Page 21: Dataflow - A Unified Model for Batch and Streaming Data Processing

MapReduce

Tuesday [11:00 - 12:00)

[12:00 - 13:00)

[13:00 - 14:00)

[14:00 - 15:00)

[15:00 - 16:00)

[16:00 - 17:00)

[18:00 - 19:00)

[19:00 - 20:00)

[21:00 - 22:00)

[22:00 - 23:00)

[23:00 - 0:00)

Batch Patterns: Time Based Windows

Page 22: Dataflow - A Unified Model for Batch and Streaming Data Processing

MapReduce

TuesdayWednesday

Batch Patterns: Sessions

Jose

Lisa

Ingo

Asha

Cheryl

Ari

WednesdayTuesday

Page 23: Dataflow - A Unified Model for Batch and Streaming Data Processing

MillWheel: Streaming Computations

● Framework for building low-latency data-processing applications

● User provides a DAG of computations to be performed

● System manages state and persistent flow of elements

Page 24: Dataflow - A Unified Model for Batch and Streaming Data Processing

Streaming Patterns: Element-wise transformations

13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time

Page 25: Dataflow - A Unified Model for Batch and Streaming Data Processing

Streaming Patterns: Aggregating Time Based Windows

13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time

Page 26: Dataflow - A Unified Model for Batch and Streaming Data Processing

11:0010:00 15:0014:0013:0012:00Event Time

11:0010:00 15:0014:0013:0012:00Processing Time

Input

Output

Streaming Patterns: Event Time Based Windows

Page 27: Dataflow - A Unified Model for Batch and Streaming Data Processing

Streaming Patterns: Session Windows

Event Time

Processing Time 11:0010:00 15:0014:0013:0012:00

11:0010:00 15:0014:0013:0012:00

Input

Output

Page 28: Dataflow - A Unified Model for Batch and Streaming Data Processing

Proc

essi

ng T

ime

Event Time

Skew

Event-Time Skew

Watermark Watermarks describe event time progress.

"No timestamp earlier than the watermark will be seen"

Often heuristic-based.

Too Slow? Results are delayed.Too Fast? Some data is late.

Page 29: Dataflow - A Unified Model for Batch and Streaming Data Processing

Streaming or Batch?

1 + 1 = 2 $$$Completeness Latency Cost

Why not both?

Page 30: Dataflow - A Unified Model for Batch and Streaming Data Processing

The Dataflow Model3

Page 31: Dataflow - A Unified Model for Batch and Streaming Data Processing

What are you computing?

Where in event time?

When in processing time?

How do refinements relate?

Page 32: Dataflow - A Unified Model for Batch and Streaming Data Processing

What are you computing?

• A Pipeline represents a graph of data processing transformations

• PCollections flow through the pipeline

• Optimized and executed as a unit for efficiency

Page 33: Dataflow - A Unified Model for Batch and Streaming Data Processing

What are you computing? • A PCollection<T> is a collection

of data of type T

• Maybe be bounded or unbounded in size

• Each element has an implicit timestamp

• Initially created from backing data stores

Page 34: Dataflow - A Unified Model for Batch and Streaming Data Processing

What are you computing?

PTransforms transform PCollections into other PCollections.

What Where When How

Element-Wise Aggregating Composite

Page 35: Dataflow - A Unified Model for Batch and Streaming Data Processing

Example: Computing Integer Sums

// Collection of raw log linesPCollection<String> raw = ...;

// Element-wise transformation into team/score pairsPCollection<KV<String, Integer>> input = raw.apply(ParDo.of(new ParseFn()))

// Composite transformation containing an aggregationPCollection<KV<String, Integer>> output = input .apply(Sum.integersPerKey());

What Where When How

Page 36: Dataflow - A Unified Model for Batch and Streaming Data Processing

Example: Computing Integer Sums

What Where When How

Page 37: Dataflow - A Unified Model for Batch and Streaming Data Processing

What Where When How

Example: Computing Integer Sums

Page 38: Dataflow - A Unified Model for Batch and Streaming Data Processing

Key 2

Key 1

Key 3

1

Fixed

2

3

4

Key 2

Key 1

Key 3

Sliding

123

54

Key 2

Key 1

Key 3

Sessions

2

43

1

Where in Event Time?

• Windowing divides data into event-time-based finite chunks.

• Required when doing aggregations over unbounded data.

What Where When How

Page 39: Dataflow - A Unified Model for Batch and Streaming Data Processing

PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(Minutes(2)))) .apply(Sum.integersPerKey());

What Where When How

Example: Fixed 2-minute Windows

Page 40: Dataflow - A Unified Model for Batch and Streaming Data Processing

What Where When How

Example: Fixed 2-minute Windows

Page 41: Dataflow - A Unified Model for Batch and Streaming Data Processing

What Where When How

When in Processing Time?

• Triggers control when results are emitted.

• Triggers are often relative to the watermark.Pr

oces

sing

Tim

e

Event Time

Watermark

Page 42: Dataflow - A Unified Model for Batch and Streaming Data Processing

PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark()) .apply(Sum.integersPerKey());

What Where When How

Example: Triggering at the Watermark

Page 43: Dataflow - A Unified Model for Batch and Streaming Data Processing

What Where When How

Example: Triggering at the Watermark

Page 44: Dataflow - A Unified Model for Batch and Streaming Data Processing

What Where When How

Example: Triggering for Speculative & Late Data

PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());

Page 45: Dataflow - A Unified Model for Batch and Streaming Data Processing

What Where When How

Example: Triggering for Speculative & Late Data

Page 46: Dataflow - A Unified Model for Batch and Streaming Data Processing

What Where When How

How do Refinements Relate?• How should multiple outputs per window

accumulate?• Appropriate choice depends on consumer.

Firing Elements

Speculative 3

Watermark 5, 1

Late 2

Total Observ 11

Discarding

3

6

2

11

Accumulating

3

9

11

23

Acc. & Retracting

3

9, -3

11, -9

11

Page 47: Dataflow - A Unified Model for Batch and Streaming Data Processing

PCollection<KV<String, Integer>> output = input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetracting()) .apply(new Sum());

What Where When How

Example: Add Newest, Remove Previous

Page 48: Dataflow - A Unified Model for Batch and Streaming Data Processing

What Where When How

Example: Add Newest, Remove Previous

Page 49: Dataflow - A Unified Model for Batch and Streaming Data Processing

1. Classic Batch 2. Batch with Fixed Windows

3. Streaming 5. Streaming with Retractions

4. Streaming with Speculative + Late Data

Customizing What Where When How

What Where When How

Page 50: Dataflow - A Unified Model for Batch and Streaming Data Processing

Demo: Google Cloud Dataflow4

Page 51: Dataflow - A Unified Model for Batch and Streaming Data Processing

Cloud DataflowA fully-managed cloud service and programming model for batch and streaming big data processing.

Google Cloud Dataflow

Page 52: Dataflow - A Unified Model for Batch and Streaming Data Processing

Open Source SDKs

● Used to construct a Dataflow pipeline.

● Java available now. Python in the open beta.

● Pipelines can run…○ On your development machine○ On the Dataflow Service on Google Cloud Platform ○ On third party environments like Spark or Flink.

Page 53: Dataflow - A Unified Model for Batch and Streaming Data Processing

Fully Managed Dataflow Service

Runs the pipeline on Google Cloud Platform. Includes:

● Graph optimization: Modular code, efficient execution● Smart Workers: Lifecycle management, Autoscaling, and

Smart task rebalancing ● Easy Monitoring: Dataflow UI, Restful API and CLI,

Integration with Cloud Logging, etc.

Page 54: Dataflow - A Unified Model for Batch and Streaming Data Processing

Demo Time!

Page 55: Dataflow - A Unified Model for Batch and Streaming Data Processing

user11_BisqueEmu,BisqueEmu,11,1460980305000,2016-04-18 04:59:27.936user0_BisqueQuokka,BisqueQuokka,15,1460980768000,2016-04-18 04:59:28.473user11_ArmyGreenNumbat,ArmyGreenNumbat,5,1460980768000,2016-04-18 04:59:28.473user4_AquaEmu,AquaEmu,19,1460980768000,2016-04-18 04:59:28.473user5_BeigeDingo,BeigeDingo,16,1460980768000,2016-04-18 04:59:28.473user1_ArmyGreenKookaburra,ArmyGreenKookaburra,9,1460980768000,2016-04-18 04:59:28.473user7_AquaDingo,AquaDingo,18,1460980768000,2016-04-18 04:59:28.473user7_AmberEmu,AmberEmu,11,1460980768000,2016-04-18 04:59:28.473user1_BeigeDingo,BeigeDingo,6,1460980768000,2016-04-18 04:59:28.473user2_ArmyGreenKookaburra,ArmyGreenKookaburra,8,1460980768000,2016-04-18 04:59:28.473user10_AzureBandicoot,AzureBandicoot,12,1460980768000,2016-04-18 04:59:28.473user1_AzureBandicoot,AzureBandicoot,12,1460980768000,2016-04-18 04:59:28.473Robot-12,BeigeQuokka,12,1460980768000,2016-04-18 04:59:28.473user3_MagentaAntechinus,MagentaAntechinus,5,1460980768000,2016-04-18 04:59:28.473user6_AmberEmu,AmberEmu,15,1460980768000,2016-04-18 04:59:28.473

Data

Page 56: Dataflow - A Unified Model for Batch and Streaming Data Processing

Wrote Reusable PTransforms

Ran the PTransform in streaming mode

Used event time, not processing time

Easily adjusted windowing and late data settings

Page 57: Dataflow - A Unified Model for Batch and Streaming Data Processing

Learn More!

● The Dataflow Model @VLDB 2015 http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

● Dataflow SDK for Javahttps://github.com/GoogleCloudPlatform/DataflowJavaSDK

● Google Cloud Dataflow on Google Cloud Platformhttp://cloud.google.com/dataflow (Free Trial!)