Top Banner
Frances Perry & Tyler Akidau @francesjperry, @takidau Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version of slides (including animations): https://goo.gl/yzvLXe
58

with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

May 28, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Frances Perry & Tyler Akidau@francesjperry, @takidauApache Beam Committers & Google Engineers

Fundamentals of Stream Processing with Apache Beam (incubating)

QCon San Francisco -- November 2016

Google Docs version of slides (including animations): https://goo.gl/yzvLXe

Page 2: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Infinite, Out-of-Order Data Sets

What, Where, When, How

Reasons This is Awesome

Agenda

Apache Beam (incubating)

2

4

1

3

Page 3: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Infinite, Out-of-Order Data Sets1

Page 4: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Data...

Page 5: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

...can be big...

Page 6: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

...really, really big...

TuesdayWednesday

Thursday

Page 7: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

… maybe infinitely big...

9:008:00 14:0013:0012:0011:0010:00

Page 8: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

… with unknown delays.

9:008:00 14:0013:0012:0011:0010:00

8:00

8:008:00

Page 9: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Element-wise transformations

13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time

Page 10: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Aggregating via Processing-Time Windows

13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time

Page 11: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Aggregating via Event-Time Windows

Event Time

Processing Time 11:0010:00 15:0014:0013:0012:00

11:0010:00 15:0014:0013:0012:00

Input

Output

Page 12: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Reality

Formalizing Event-Time SkewP

roce

ssin

g Ti

me

Event Time

Ideal

Skew

Page 13: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Formalizing Event-Time Skew

Watermarks describe event time progress.

"No timestamp earlier than the watermark will be seen"

Pro

cess

ing

Tim

e

Event Time

~Watermark

Ideal

Skew

Often heuristic-based.

Too Slow? Results are delayed.Too Fast? Some data is late.

Page 14: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

What, Where, When, How2

Page 15: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

What are you computing?

Where in event time?

When in processing time?

How do refinements relate?

Page 16: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

What are you computing?

What Where When How

Element-Wise Aggregating Composite

Page 17: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

What: Computing Integer Sums

// Collection of raw log linesPCollection<String> raw = IO.read(...);

// Element-wise transformation into team/score pairs

PCollection<KV<String, Integer>> input =

raw.apply(ParDo.of(new ParseFn());

// Composite transformation containing an aggregationPCollection<KV<String, Integer>> scores =

input.apply(Sum.integersPerKey());

What Where When How

Page 18: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

What: Computing Integer Sums

What Where When How

Page 19: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

What: Computing Integer Sums

What Where When How

Page 20: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Windowing divides data into event-time-based finite chunks.

Often required when doing aggregations over unbounded data.

Where in event time?

What Where When How

Fixed Sliding1 2 3

54

Sessions

2

431

Key 2

Key 1

Key 3

Time

1 2 3 4

Page 21: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Where: Fixed 2-minute Windows

What Where When How

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2)))

.apply(Sum.integersPerKey());

Page 22: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Where: Fixed 2-minute Windows

What Where When How

Page 23: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

When in processing time?

What Where When How

• Triggers control when results are emitted.

• Triggers are often relative to the watermark.

Pro

cess

ing

Tim

e

Event Time

~Watermark

Ideal

Skew

Page 24: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

When: Triggering at the Watermark

What Where When How

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2))

.triggering(AtWatermark()))

.apply(Sum.integersPerKey());

Page 25: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

When: Triggering at the Watermark

What Where When How

Page 26: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

When: Early and Late Firings

What Where When How

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))

.withLateFirings(AtCount(1))))

.apply(Sum.integersPerKey());

Page 27: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

When: Early and Late Firings

What Where When How

Page 28: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

How do refinements relate?

What Where When How

• How should multiple outputs per window accumulate?• Appropriate choice depends on consumer.

Firing Elements

Speculative [3]

Watermark [5, 1]

Late [2]

Last Observed

Total Observed

Discarding

3

6

2

2

11

Accumulating

3

9

11

11

23

Acc. & Retracting

3

9, -3

11, -9

11

11

(Accumulating & Retracting not yet implemented.)

Page 29: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

How: Add Newest, Remove Previous

What Where When How

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))

.withLateFirings(AtCount(1)))

.accumulatingAndRetractingFiredPanes())

.apply(Sum.integersPerKey());

Page 30: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

How: Add Newest, Remove Previous

What Where When How

Page 31: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Reasons This is Awesome3

Page 32: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

CorrectnessPower

ComposabilityFlexibility

Modularity

What / Where / When / How

Page 33: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

CorrectnessPower

ComposabilityFlexibility

Modularity

What / Where / When / How

Page 34: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Distributed Systems are Distributed

Page 35: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Processing Time Results Differ

Page 36: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Event Time Results are Stable

Page 37: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

CorrectnessPower

ComposabilityFlexibility

Modularity

What / Where / When / How

Page 38: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Identifying Bursts of User Activity

Page 39: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Sessions

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(Sessions.withGapDuration(Minutes(1))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))

.withLateFirings(AtCount(1)))

.accumulatingAndRetractingFiredPanes())

.apply(Sum.integersPerKey());

Page 40: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Identifying Bursts of User Activity

Page 41: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

CorrectnessPower

ComposabilityFlexibility

Modularity

What / Where / When / How

Page 42: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Calculating Session Lengthsinput .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));

Page 43: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version
Page 44: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Calculating the Average Session Length

.apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark()) .withEarlyFirings(AtPeriod(Minutes(1))) .accumulatingFiredPanes()) .apply(Mean.globally());

input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));

Page 45: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version
Page 46: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

CorrectnessPower

ComposabilityFlexibility

Modularity

What / Where / When / How

Page 47: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

1.Classic Batch 2. Batch with Fixed Windows

3. Streaming

5. Streaming With Retractions

4. Streaming with Speculative + Late Data

6. Sessions

Page 48: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

CorrectnessPower

ComposabilityFlexibility

Modularity

What / Where / When / How

Page 49: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

PCollection<KV<String, Integer>> scores = input

.apply(Sum.integersPerKey());

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))

.withLateFirings(AtCount(1)))

.accumulatingAndRetractingFiredPanes())

.apply(Sum.integersPerKey());

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))

.withLateFirings(AtCount(1)))

.apply(Sum.integersPerKey());

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2))

.triggering(AtWatermark()))

.apply(Sum.integersPerKey());

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2)))

.apply(Sum.integersPerKey());

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(Sessions.withGapDuration(Minutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))

.withLateFirings(AtCount(1)))

.accumulatingAndRetractingFiredPanes())

.apply(Sum.integersPerKey());

1.Classic Batch 2. Batch with Fixed Windows

3. Streaming

5. Streaming With Retractions

4. Streaming with Speculative + Late Data

6. Sessions

Page 50: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

CorrectnessPower

ComposabilityFlexibility

Modularity

What / Where / When / How

Page 51: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Apache Beam (incubating)4

Page 52: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

The Evolution of Beam

MapReduce

Google Cloud Dataflow

Apache Beam

BigTable DremelColossus

FlumeMegastoreSpanner

PubSub

Millwheel

Page 53: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

1. The Beam Model: What / Where / When / How

2. SDKs for writing Beam pipelines -- Java and Python

3. Runners for Existing Distributed Processing Backends• Apache Flink • Apache Spark • Google Cloud Dataflow • Direct runner for local development and testing• In development: Apache Gearpump and Apache Apex

What is Part of Apache Beam?

Page 54: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

1. End users: who want to write pipelines or transform libraries in a language that’s familiar.

2. SDK writers: who want to make Beam concepts available in new languages.

3. Runner writers: who have a distributed processing environment and want to support Beam pipelines

Apache Beam Technical Vision

Beam Model: Fn Runners

Runner A Runner B

Beam Model: Pipeline Construction

OtherLanguagesBeam Java Beam

Python

Execution Execution

Cloud Dataflow

Execution

Page 55: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

2016-02-01Enter Apache

Incubator

Early 2016Internal API redesign

and chaos

Mid 2016API Stabilization

Late 2016Multiple runners execute Beam

pipelines

2016-02-251st commit to ASF repository

2016-06-080.1.0-incubating

release

2016-07-280.2.0-incubating

release

Visions are a Journey

2016-10-21Three new

committers

2016-10-310.3.0-incubating

release

Page 56: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Categorizing Runner Capabilities

http://beam.incubator.apache.org/ documentation/runners/capability-matrix/

Page 57: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Learn More !Streaming Fundamentals: The World Beyond Batch 101 & 102

http://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 http://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Apache Beam (incubating)http://beam.incubator.apache.org

Join the Beam [email protected]@beam.incubator.apache.org

Slides for this talkhttp://goo.gl/yzvLXe

Follow @ApacheBeam on Twitter

Page 58: with Apache Beam (incubating) Fundamentals of Stream ... · Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version

Thank you!