Top Banner
Frances Perry & Tyler Akidau @francesjperry, @takidau Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco -- November 2016 Google Docs version of slides (including animations): https://goo.gl/yzvLXe
58

with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

May 28, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Frances Perry & Tyler Akidau@francesjperry, @takidauApache Beam Committers & Google Engineers

Fundamentals of Stream Processing with Apache Beam (incubating)

QCon San Francisco -- November 2016

Google Docs version of slides (including animations): https://goo.gl/yzvLXe

Page 2: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Infinite, Out-of-Order Data Sets

What, Where, When, How

Reasons This is Awesome

Agenda

Apache Beam (incubating)

2

4

1

3

Page 3: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Infinite, Out-of-Order Data Sets1

Page 4: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Data...

Page 5: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

...can be big...

Page 6: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

...really, really big...

TuesdayWednesday

Thursday

Page 7: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

… maybe infinitely big...

9:008:00 14:0013:0012:0011:0010:00

Page 8: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

… with unknown delays.

9:008:00 14:0013:0012:0011:0010:00

8:00

8:008:00

Page 9: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Element-wise transformations

13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time

Page 10: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Aggregating via Processing-Time Windows

13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time

Page 11: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Aggregating via Event-Time Windows

Event Time

Processing Time 11:0010:00 15:0014:0013:0012:00

11:0010:00 15:0014:0013:0012:00

Input

Output

Page 12: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Reality

Formalizing Event-Time SkewP

roce

ssin

g Ti

me

Event Time

Ideal

Skew

Page 13: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Formalizing Event-Time Skew

Watermarks describe event time progress.

"No timestamp earlier than the watermark will be seen"

Pro

cess

ing

Tim

e

Event Time

~Watermark

Ideal

Skew

Often heuristic-based.

Too Slow? Results are delayed.Too Fast? Some data is late.

Page 14: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

What, Where, When, How2

Page 15: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

What are you computing?

Where in event time?

When in processing time?

How do refinements relate?

Page 16: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

What are you computing?

What Where When How

Element-Wise Aggregating Composite

Page 17: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

What: Computing Integer Sums

// Collection of raw log linesPCollection<String> raw = IO.read(...);

// Element-wise transformation into team/score pairs

PCollection<KV<String, Integer>> input =

raw.apply(ParDo.of(new ParseFn());

// Composite transformation containing an aggregationPCollection<KV<String, Integer>> scores =

input.apply(Sum.integersPerKey());

What Where When How

Page 18: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

What: Computing Integer Sums

What Where When How

Page 19: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

What: Computing Integer Sums

What Where When How

Page 20: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Windowing divides data into event-time-based finite chunks.

Often required when doing aggregations over unbounded data.

Where in event time?

What Where When How

Fixed Sliding1 2 3

54

Sessions

2

431

Key 2

Key 1

Key 3

Time

1 2 3 4

Page 21: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Where: Fixed 2-minute Windows

What Where When How

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2)))

.apply(Sum.integersPerKey());

Page 22: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Where: Fixed 2-minute Windows

What Where When How

Page 23: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

When in processing time?

What Where When How

• Triggers control when results are emitted.

• Triggers are often relative to the watermark.

Pro

cess

ing

Tim

e

Event Time

~Watermark

Ideal

Skew

Page 24: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

When: Triggering at the Watermark

What Where When How

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2))

.triggering(AtWatermark()))

.apply(Sum.integersPerKey());

Page 25: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

When: Triggering at the Watermark

What Where When How

Page 26: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

When: Early and Late Firings

What Where When How

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))

.withLateFirings(AtCount(1))))

.apply(Sum.integersPerKey());

Page 27: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

When: Early and Late Firings

What Where When How

Page 28: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

How do refinements relate?

What Where When How

• How should multiple outputs per window accumulate?• Appropriate choice depends on consumer.

Firing Elements

Speculative [3]

Watermark [5, 1]

Late [2]

Last Observed

Total Observed

Discarding

3

6

2

2

11

Accumulating

3

9

11

11

23

Acc. & Retracting

3

9, -3

11, -9

11

11

(Accumulating & Retracting not yet implemented.)

Page 29: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

How: Add Newest, Remove Previous

What Where When How

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))

.withLateFirings(AtCount(1)))

.accumulatingAndRetractingFiredPanes())

.apply(Sum.integersPerKey());

Page 30: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

How: Add Newest, Remove Previous

What Where When How

Page 31: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Reasons This is Awesome3

Page 32: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

CorrectnessPower

ComposabilityFlexibility

Modularity

What / Where / When / How

Page 33: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

CorrectnessPower

ComposabilityFlexibility

Modularity

What / Where / When / How

Page 34: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Distributed Systems are Distributed

Page 35: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Processing Time Results Differ

Page 36: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Event Time Results are Stable

Page 37: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

CorrectnessPower

ComposabilityFlexibility

Modularity

What / Where / When / How

Page 38: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Identifying Bursts of User Activity

Page 39: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Sessions

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(Sessions.withGapDuration(Minutes(1))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))

.withLateFirings(AtCount(1)))

.accumulatingAndRetractingFiredPanes())

.apply(Sum.integersPerKey());

Page 40: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Identifying Bursts of User Activity

Page 41: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

CorrectnessPower

ComposabilityFlexibility

Modularity

What / Where / When / How

Page 42: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Calculating Session Lengthsinput .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));

Page 43: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco
Page 44: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Calculating the Average Session Length

.apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark()) .withEarlyFirings(AtPeriod(Minutes(1))) .accumulatingFiredPanes()) .apply(Mean.globally());

input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));

Page 45: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco
Page 46: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

CorrectnessPower

ComposabilityFlexibility

Modularity

What / Where / When / How

Page 47: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

1.Classic Batch 2. Batch with Fixed Windows

3. Streaming

5. Streaming With Retractions

4. Streaming with Speculative + Late Data

6. Sessions

Page 48: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

CorrectnessPower

ComposabilityFlexibility

Modularity

What / Where / When / How

Page 49: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

PCollection<KV<String, Integer>> scores = input

.apply(Sum.integersPerKey());

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))

.withLateFirings(AtCount(1)))

.accumulatingAndRetractingFiredPanes())

.apply(Sum.integersPerKey());

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))

.withLateFirings(AtCount(1)))

.apply(Sum.integersPerKey());

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2))

.triggering(AtWatermark()))

.apply(Sum.integersPerKey());

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2)))

.apply(Sum.integersPerKey());

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(Sessions.withGapDuration(Minutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))

.withLateFirings(AtCount(1)))

.accumulatingAndRetractingFiredPanes())

.apply(Sum.integersPerKey());

1.Classic Batch 2. Batch with Fixed Windows

3. Streaming

5. Streaming With Retractions

4. Streaming with Speculative + Late Data

6. Sessions

Page 50: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

CorrectnessPower

ComposabilityFlexibility

Modularity

What / Where / When / How

Page 51: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Apache Beam (incubating)4

Page 52: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

The Evolution of Beam

MapReduce

Google Cloud Dataflow

Apache Beam

BigTable DremelColossus

FlumeMegastoreSpanner

PubSub

Millwheel

Page 53: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

1. The Beam Model: What / Where / When / How

2. SDKs for writing Beam pipelines -- Java and Python

3. Runners for Existing Distributed Processing Backends• Apache Flink • Apache Spark • Google Cloud Dataflow • Direct runner for local development and testing• In development: Apache Gearpump and Apache Apex

What is Part of Apache Beam?

Page 54: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

1. End users: who want to write pipelines or transform libraries in a language that’s familiar.

2. SDK writers: who want to make Beam concepts available in new languages.

3. Runner writers: who have a distributed processing environment and want to support Beam pipelines

Apache Beam Technical Vision

Beam Model: Fn Runners

Runner A Runner B

Beam Model: Pipeline Construction

OtherLanguagesBeam Java Beam

Python

Execution Execution

Cloud Dataflow

Execution

Page 55: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

2016-02-01Enter Apache

Incubator

Early 2016Internal API redesign

and chaos

Mid 2016API Stabilization

Late 2016Multiple runners execute Beam

pipelines

2016-02-251st commit to ASF repository

2016-06-080.1.0-incubating

release

2016-07-280.2.0-incubating

release

Visions are a Journey

2016-10-21Three new

committers

2016-10-310.3.0-incubating

release

Page 56: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Categorizing Runner Capabilities

http://beam.incubator.apache.org/ documentation/runners/capability-matrix/

Page 57: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Learn More !Streaming Fundamentals: The World Beyond Batch 101 & 102

http://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 http://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Apache Beam (incubating)http://beam.incubator.apache.org

Join the Beam [email protected]@beam.incubator.apache.org

Slides for this talkhttp://goo.gl/yzvLXe

Follow @ApacheBeam on Twitter

Page 58: with Apache Beam (incubating) Fundamentals of Stream ... · Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) QCon San Francisco

Thank you!