Top Banner
Introduction to Apache Flink™: How Stream Processing is Shaping the Data Engineering Space [email protected] Tzu-Li (Gordon) Tai @tzulitai
44

Introduction to Apache Flink™: How Stream Processing is Shaping ...

Feb 13, 2017

Download

Documents

ngoquynh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Apache Flink™: How Stream Processing is Shaping ...

Introduction to Apache Flink™:How Stream Processing is Shaping

the Data Engineering Space

[email protected]

Tzu-Li (Gordon) Tai

@tzulitai

Page 2: Introduction to Apache Flink™: How Stream Processing is Shaping ...

● 戴資力(Gordon)● Apache Flink Committer● Co-organizer of Apache Flink Taiwan User Group● Software Engineer @ VMFive● Java, Scala● Enjoy developing distributed systems

Who am I?

Page 3: Introduction to Apache Flink™: How Stream Processing is Shaping ...

Data Streaming is becomingincreasingly popular

1

Page 4: Introduction to Apache Flink™: How Stream Processing is Shaping ...

Stream processing is enabling the obvious: continuous processing on data that is continuously produced

2

Page 5: Introduction to Apache Flink™: How Stream Processing is Shaping ...

Streaming is the next programming paradigm for data applications, and you need to start thinking in terms

of streams

3

Page 6: Introduction to Apache Flink™: How Stream Processing is Shaping ...

01 The Traditional Batch Wayt

...

HDFSFile

MapReduce /Spark / Flink

Jobs

● Continouslyingesting data

● Periodicbatch files

● Periodicbatch jobs

4

Page 7: Introduction to Apache Flink™: How Stream Processing is Shaping ...

01 The Traditional Batch Wayt

...

cross boundary

intermediateresults

● Jobs often has “dangling” results near batch boundaries

● Need to save them, and input into the next batch job

5

Page 8: Introduction to Apache Flink™: How Stream Processing is Shaping ...

02 Key Observations for Batch

6

● Way too many moving parts

● Implicit treatment of time (the batch boundaries)

● Treating continuous state as discrete

● Troublesome to get accurate, correct results

Page 9: Introduction to Apache Flink™: How Stream Processing is Shaping ...

03 The “Ideal” Streaming Wayt

...

Streaming processor that handles …

(1) continuous state(2) out-of-order events

scalably, robustly, and efficiently

7

Page 10: Introduction to Apache Flink™: How Stream Processing is Shaping ...

04 Apache Flink

Apache Flinkan open-source platform for distributed stream and batch data processing

● Apache Top-Level Project since Jan. 2015

● Streaming Dataflow Engine at its core○ Low latency○ High Throughput○ Stateful○ Accurate○ Distributed

8

Page 11: Introduction to Apache Flink™: How Stream Processing is Shaping ...

04 Apache Flink

Apache Flinkan open-source platform for distributed stream and batch data processing

● ~260 contributors, ~25 Committers / PMC

● Used adoption:○ Alibaba - realtime search optimization○ Uber - ride request fulfillment marketplace○ Netflix - Stream Processing as a Service (SPaaS)○ Kings Gaming - realtime data science dashboard○ ...

9

Page 12: Introduction to Apache Flink™: How Stream Processing is Shaping ...

04 Apache Flink

10

Page 13: Introduction to Apache Flink™: How Stream Processing is Shaping ...

05 Scala Collection-like APIcase class Word (word: String, count: Int)

val lines: DataSet[String] = env.readTextFile(...)

lines.flatMap(_.split(“ ”)).map(word => Word(word,1)).groupBy(“word”).sum(“count”).print()

val lines: DataStream[String] = env.addSource(new KafkaSource(...))

lines.flatMap(_.split(“ ”)).map(word => Word(word,1)).keyBy(“word”).timeWindow(Time.seconds(5)).sum(“count”).print()

DataSet API

DataStream API

11

Page 14: Introduction to Apache Flink™: How Stream Processing is Shaping ...

05 Scala Collection-like API

.filter(...).flatmap(...).map(...).groupBy(...).reduce(...)

● Becoming the de facto standard for new generation API to express data pipelines

● Apache Spark, Apache Flink, Apache Beam ...

12

Page 15: Introduction to Apache Flink™: How Stream Processing is Shaping ...

06 What does Flink’s Engine do?

YourCode

process records one-at-a-time

...

● Computation on a never-ending stream of data records

13

Page 16: Introduction to Apache Flink™: How Stream Processing is Shaping ...

06 What does Flink’s Engine do?

YourCode...

● System distributes the computation across the cluster

YourCode...

YourCode...

14

Page 17: Introduction to Apache Flink™: How Stream Processing is Shaping ...

07 Streaming Dataflow Runtime

JobManager

TaskManager

TaskManager

TaskManager

TaskManager

TaskManager

TaskManager

ExecutionGraph

(parallel) TaskManager

TaskManager

TaskManager

Application Code

(DataSet /DataStream)

Optimizer / Graph

Generator

JobGraph(logical)

Client

concurrently executed

distributed queues as push-based data shipping channels

15

Page 18: Introduction to Apache Flink™: How Stream Processing is Shaping ...

07 Streaming Dataflow Runtime

1. Record “A” enters Task 1, and is processed

2. The record is serialized into an output buffer at Task 1

3. The buffer is shipped to Task 2’s input buffer

Observation: Buffers need to be available throughout the process (think blocking queues used between threads)

● A slightly closer look into the transmission of data ...

Taken from anoutput buffer pool

Taken from aninput buffer pool

16

Page 19: Introduction to Apache Flink™: How Stream Processing is Shaping ...

07 Streaming Dataflow Runtime● Natural, built-in backpressure

● Receiving data at a higher rate than a system can process during a temporary load spike

○ ex. GC talls at processing tasks○ ex. data source natural load spike

Normal stable case:

Temporary load spike:

Ideal backpressure handling:

17

Page 20: Introduction to Apache Flink™: How Stream Processing is Shaping ...

● Due to one-at-a-time processing, Flink has very powerful built-in windowing (certainly among the best in the current streaming framework solutions)

○ Time-driven: Tumbling window, Sliding window○ Data-driven: Count window, Session window

07 Flexible Windows

18

Page 21: Introduction to Apache Flink™: How Stream Processing is Shaping ...

07 Time Windows

Tumbling Time Window Sliding Time Window

19

Page 22: Introduction to Apache Flink™: How Stream Processing is Shaping ...

07 Count-Triggered Windows

20

Page 23: Introduction to Apache Flink™: How Stream Processing is Shaping ...

07 Session Windows

21

Page 24: Introduction to Apache Flink™: How Stream Processing is Shaping ...

08 What does Flink’s Engine do?

YourCode

...

State

● Computation and state, ex.:○ counters○ in-progress

windows○ state machines○ trained ML

models

● Results depend onhistory of stream

● A stateful streamprocessor gives tools to manage state

22

Page 25: Introduction to Apache Flink™: How Stream Processing is Shaping ...

09 What does Flink’s Engine do?

YourCode

...

State

● Processingdepends on timestamps of whenevents were generated

● Core mechanics is called watermarks: basically a way to measure and advance clock time, instead of relying on machine time

t1t2 t3t4 t1 - t2 t3 - t4

23

Page 26: Introduction to Apache Flink™: How Stream Processing is Shaping ...

09 Different Kinds of “Time”

24

Page 27: Introduction to Apache Flink™: How Stream Processing is Shaping ...

09 Why Wall Time is Incorrect

● Think Twitter hash-tag count every 5 minutes

○ We would want the result to reflect the number of Twitter tweets actually tweeted in a 5 minute window

○ Not the number of tweet events the stream processor receives within 5 minutes

25

Page 28: Introduction to Apache Flink™: How Stream Processing is Shaping ...

09 Why Wall Time is Incorrect

● Think replaying a Kafka topic on a windowed streaming application …

○ If you’re replaying a queue, windows are definitely wrong if using a wall clock

26

Page 29: Introduction to Apache Flink™: How Stream Processing is Shaping ...

10 Flink’s Streaming Fault Tolerance

YourCode

YourCode

YourCode

State

State

State

...

...

...

...

...

...

...

...

...

YourCode

YourCode

YourCode

State

State

State

● Any operator in a Flink streaming topology can be stateful● How to ensure that the states are correct upon failure?

27

Page 30: Introduction to Apache Flink™: How Stream Processing is Shaping ...

10 Flink’s Streaming Fault Tolerance

● First, a recap of some guarantee concepts:

○ At-least-once: records may be processed more than once.Think counting: may over count, resulting in wrong state

○ Exactly-once “state”: records appear to be processed only once, with respect to the state.Think counting: even on failure, each record is counted exactly once

○ End-to-end exactly-once: records appear to be processed only once, even to external systemsThink counting: for results stored externally, even after failure, the results remain correct

28

Page 31: Introduction to Apache Flink™: How Stream Processing is Shaping ...

11 Flink’s Streaming Fault Tolerance

YourCode

YourCode

YourCode

State

State

State

...

...

...

...

...

...

...

...

...

YourCode

YourCode

YourCode

State

State

State

● Flink checkpoints: a combined snapshot of all operator state, with the corresponding position in the source

● Based on Chandly-Lamport Algorithm: does not haltany computation while taking consistent snapshots

29

Page 32: Introduction to Apache Flink™: How Stream Processing is Shaping ...

12 Flink’s Savepoints

● Flink checkpoints: consistent snapshots of the whole topology state that the system periodically takes

● Flink savepoints: manually triggered checkpoints that can be persisted, and used to initialize state for a new streaming job

tt1 t2 t3

savepointstate at t1

savepointstate at t2

savepointstate at t3

30

Page 33: Introduction to Apache Flink™: How Stream Processing is Shaping ...

13 So, back to this ...

t

...

HDFSFile

MapReduce /Spark / Flink

Jobs

31

Page 34: Introduction to Apache Flink™: How Stream Processing is Shaping ...

13 So, back to this ...

t

...

Streaming processor that handles …

(1) continuous state(2) out-of-order events

scalably, robustly, and efficiently

32

Page 35: Introduction to Apache Flink™: How Stream Processing is Shaping ...

14 What Flink provides, in a nutshell example

● No stateless point-in-time

33

Page 36: Introduction to Apache Flink™: How Stream Processing is Shaping ...

14 What Flink provides, in a nutshell example

● Processing, or re-processing, in the batch way

34

Page 37: Introduction to Apache Flink™: How Stream Processing is Shaping ...

14 What Flink provides, in a nutshell example

● Batch is inherently unsuitable for the nature of continuously generated data

● State is corrupt at boundaries

35

Page 38: Introduction to Apache Flink™: How Stream Processing is Shaping ...

14 What Flink provides, in a nutshell example

● Flink’s Stateful streaming naturally treats state continuously as it processes your continuous data, and continuously generates results

36

Page 39: Introduction to Apache Flink™: How Stream Processing is Shaping ...

14 What Flink provides, in a nutshell example

● On reprocessing: initial state for the job reflects all previous history data in the stream

37

Page 40: Introduction to Apache Flink™: How Stream Processing is Shaping ...

14 What Flink provides, in a nutshell example

● On reprocessing: event-time processing guarantees correct results, even when fast-forwarding to the head of stream

event-time processing

38

Page 41: Introduction to Apache Flink™: How Stream Processing is Shaping ...

15 Final Takeaways

● Stateful Streaming correctly embraces the nature of continuously generated data, and is the new programming paradigm for their applications.

● Streaming isn’t only about real-time. Realtime is only a natural advantage of streaming.

39

Page 42: Introduction to Apache Flink™: How Stream Processing is Shaping ...

15 Final Takeaways● The choice is all about your data, and your code.

● Think:○ Is your data unbounded, or bounded?

■ Unbounded: click streams, page visits, impressions …■ Bounded: (???)

● Think:○ Does your code change faster than your data?

■ Data exploration, data mining, feature engineering …■ In this case, it doesn’t really matter whether you use batch or streaming

○ Or does your data change faster than your code?■ Production ETL pipelines, warehousing, serving, etc.■ For accuracy and robustness, definitely think and design in terms of

streaming

40

Page 43: Introduction to Apache Flink™: How Stream Processing is Shaping ...

15 Final Takeaways

● Upcoming features in Flink:

○ Dynamic scaling, with stateful streaming○ Queryable state○ Incremental state checkpointing○ Even more savepoint functionality

41

Page 44: Introduction to Apache Flink™: How Stream Processing is Shaping ...

15 Final Takeaways

● How Flink’s technology covers the application space:

Application

Realtime applications

Continuous applications

Analytics on historical data

Request/Response Apps

Technology

Low-latency stateful streaming

High-latency stateful streaming

Batch as special case of streaming

Queryable state

42