Top Banner
Apache Flink retrospective, roadmap, and vision
34

Flink history, roadmap and vision

Jul 17, 2015

Download

Software

Stephan Ewen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Flink history, roadmap and vision

Apache Flinkretrospective, roadmap, and vision

Page 2: Flink history, roadmap and vision

A trip down memory lane

2

Page 3: Flink history, roadmap and vision

April 16, 2014

3

Page 4: Flink history, roadmap and vision

4

Pact Optimizer

Pact API (Java)

Pact Runtime & Nephele

Stratosphere 0.2

Page 5: Flink history, roadmap and vision

5

Stratosphere Optimizer

Pact API (Java)

Stratosphere Runtime

DataSet API (Scala)

Stratosphere 0.4

Local Remote Yarn

Page 6: Flink history, roadmap and vision

6

Stratosphere Optimizer

DataSet API (Java)

Stratosphere Runtime

DataSet API (Scala)

Stratosphere 0.5

Local Remote Yarn

Page 7: Flink history, roadmap and vision

7

Flink Optimizer

DataSet API (Java)

Flink Runtime

DataSet API (Scala)

Flink 0.6

Local Remote Yarn

Page 8: Flink history, roadmap and vision

8

Flink Optimizer

DataSet (Java/Scala)

Flink Runtime

Flink 0.7

DataStream (Java)

Stream BuilderHadoop

M/R

Local Remote Yarn Embedded

Page 9: Flink history, roadmap and vision

9

Flink Runtime

Flink 0.8

Flink Optimizer

DataSet (Java/Scala) DataStream (Java/Scala)

Stream BuilderHadoop

M/R

Local Remote Yarn Embedded

Page 10: Flink history, roadmap and vision

10

Pyth

on

Ge

lly

Table

ML

SA

MO

A

Current master + some outstanding PRs

Flink Optimizer

DataSet (Java/Scala) DataStream (Java/Scala)

Stream BuilderHadoop

M/R

New Flink Runtime

Local Remote Yarn Tez EmbeddedD

ata

flow

Data

flow

Page 11: Flink history, roadmap and vision

Summary

Almost complete code rewrite from Stratosphere 0.2 to Flink 0.8

Project diversification

• Real-time data streaming

• Several frontends (targeting different user profiles and use cases)

• Several backends (targeting different production settings)

Integration with open source ecosystem

11

Page 12: Flink history, roadmap and vision

Community Activity

12

0

20

40

60

80

100

120

Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-13 Nov-13 Jun-14 Dec-14 Jul-15

#unique contributors by git commits

(without manual de-dup)

Page 13: Flink history, roadmap and vision

Vision for Flink

13

Page 14: Flink history, roadmap and vision

What are we building?

14

A "use-case complete" framework to unify

batch & stream processing

Flink

Event logs

Historic data

ETL RelationalGraph analysisML Streaming aggregations

Page 15: Flink history, roadmap and vision

Flink

Historic data

Kafka, RabbitMQ, ...

HDFS, JDBC, ...

ETL, Graphs,

Machine Learning

Relational, …

Low latency

windowing,

aggregations, ...

Event logs

Via an engine that puts equal emphasis to

streaming and batch processingReal-time data

streams

What are we building?

(master)

Page 16: Flink history, roadmap and vision

16

Pyth

on

Gelly

Table

ML

SA

MO

AFlink Optimizer

DataSet (Java/Scala) DataStream (Java/Scala)

Stream BuilderHadoop

M/R

Flink Runtime

Local Remote Yarn Tez Embedded

Data

flow

Data

flow

Focus this talk on stream processing with Flink

Batch processing with Flink more well-understood and with clear roadmap

Table

Page 17: Flink history, roadmap and vision

Life of data streams

Create: create streams from event sources (machines, databases, logs, sensors, …)

Collect: collect and make streams available for consumption (e.g., Apache Kafka)

Process: process streams, possibly generating derived streams (e.g., Apache Flink)

17

Page 18: Flink history, roadmap and vision

Lambda architecture

"Speed layer" can be a stream processing system

"Picks up" after the batch layer

18

Page 19: Flink history, roadmap and vision

Kappa architecture

Need for batch & speed layer not fundamental, practical with current tech

Idea: use a stream processing system for all data processing

They are all dataflows anyway

19http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

Page 20: Flink history, roadmap and vision

Data streaming with Flink

Flink is building a proper stream processing system

• that can execute both batch and stream jobs natively

• batch-only jobs pass via different optimization code path

Flink is building libraries and DSLs on top of both batch and streaming

• e.g., see recent Table API

20

Page 21: Flink history, roadmap and vision

Additions to Kappa

Dataflow systems are good, but they are the bottom-most layer

In addition to a streaming dataflow system, we need• Different APIs (e.g., window definitions)

• Different optimization code paths

• Different management of local memory and disk

Our approach: build these on top of a common distributed streaming dataflow system

21

Page 22: Flink history, roadmap and vision

Building blocks for streaming

Pipelining

Replay

Operator state

State backup

High-level language(s)

Integration with static sources

High availability

22

See also: • Stonebraker et al. "The 8 requirements of real-time stream processing."• https://highlyscalable.wordpress.com/2013/08/20/in-stream-big-data-processing/

Page 23: Flink history, roadmap and vision

Building blocks for streaming

Pipelining

• "Keep the data moving"

Replay

• Tolerate machine failures

Operator state

• For anything more interesting than filters

State backup/restore

• App does not worry about duplicates

23

Page 24: Flink history, roadmap and vision

Pipelining

Flink has always had pipelining

Pipelined shuffles inspired by databases

(e.g., Impala) used for batch

Later, DataStream API used the same

mechanism

24

Page 25: Flink history, roadmap and vision

Pipelining

25

Page 26: Flink history, roadmap and vision

Replay

Storm acknowledges individual events

(records)

Flink acknowledges batches of records

• Less overhead in failure-free case

• Works only with fault tolerant data sources

(e.g., Kafka)

• Coming: Retaining batches input data in Flink

sources for replay

26

Page 27: Flink history, roadmap and vision

Operator state

Flink operators can keep state

• in the form of user-defined arbitrary objects

(e.g., HashMap)

• in the form of windows (e.g., keep the last 100

elements)

Windows currently need to fit in memory

Work in progress

• Move window state out-of-core

• Backup window state externally

27

Page 28: Flink history, roadmap and vision

State backup

28

Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots

Pushes checkpoint barriersthrough the data flow

Operator checkpointstarting

Checkpoint done

Data Stream

barrier

Before barrier =part of the snapshot

After barrier =Not in snapshot

Checkpoint done

checkpoint in progress

(backup till next snapshot)

Page 29: Flink history, roadmap and vision

Flink Streaming APIs

Current DataStream API has support for flexible windows

Apache SAMOA on Flink for Machine Learning on streams

Google Dataflow (stream functionality upcoming)

Table API (window defs upcoming)

29

Page 30: Flink history, roadmap and vision

Integrating batch with

streaming

30

Page 31: Flink history, roadmap and vision

Batch + Streaming

Making the switch from batch to streaming

easy will be key to boost streaming

adoption

Applications will need to combine

streaming and static data sources

Flink supports this through a new hybrid

runtime architecture

31

Page 32: Flink history, roadmap and vision

Two ways to think about computation

Operator-centric Intermediate data-centric

32

Runtime built around

Intermediate Datasets

e.g., Spark

Runtime built

around operators

e.g., Tez, Flink*,

Dryad

* previous versions of Flink

Page 33: Flink history, roadmap and vision

Hybrid runtime architecture

33

Separating• control (program, scheduling) from• data flow (data exchange)

Intermediate results are a handle to the data produced by an operator. Coordinate the "handshake" between data producer and data consumer.

• pipelined or batch• ephemeral or checkpointed• with or without back-pressure

Operators execute program code, heavyOperations (sorting / hashing), build state, windows.

Page 34: Flink history, roadmap and vision

flink.apache.org

@ApacheFlink