Top Banner
Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1
39

Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Jan 29, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Apache FlinkFast and Reliable Large-Scale Data Processing

Fabian Hueske @fhueske1

Page 2: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

What is Apache Flink?

Distributed Data Flow Processing System

• Focused on large-scale data analytics

• Real-time stream and batch processing

• Easy and powerful APIs (Java / Scala)

• Robust execution backend

2

Page 3: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

What is Flink good at?

It‘s a general-purpose data analytics system

• Real-time stream processing with flexible windows

• Complex and heavy ETL jobs

• Analyzing huge graphs

• Machine-learning on large data sets

• ...

3

Page 4: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

4

Flink in the Hadoop Ecosystem

Table

AP

I

Gelly

Lib

rary

ML L

ibra

ry

Apache S

AM

OA

Optimizer

DataSet API (Java/Scala) DataStream API (Java/Scala)

Stream Builder

Runtime

Local Cluster Yarn Apache TezEmbedded

Apache M

RQ

L

Data

flow

HDFS

S3JDBCHCatalog

Apache HBase Apache Kafka Apache Flume

RabbitMQ

Hadoop IO

...

Data

Sources

Environments

Flink Core

Libraries

Page 5: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Flink in the ASF

• Flink entered the ASF about one year ago

– 04/2014: Incubation

– 12/2014: Graduation

• Strongly growing community

0

20

40

60

80

100

120

Nov.10 Apr.12 Aug.13 Dec.14

#unique git committers (w/o manual de-dup)5

Page 6: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Where is Flink moving?

A "use-case complete" framework to unify

batch & stream processing

Flink

Data Streams

• Kafka

• RabbitMQ

• ...

“Historic” data

• HDFS

• JDBC

• ...

Analytical Workloads

• ETL

• Relational processing

• Graph analysis

• Machine learning

• Streaming data analysis

6

Goal: Treat batch as finite stream

Page 7: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

HOW TO USE FLINK?

Programming Model & APIs

7

Page 8: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Unified Java & Scala APIs

• Fluent and mirrored APIs in Java and Scala

• Table API for relational expressions

• Batch and Streaming APIs almost identical ...

... with slightly different semantics in some cases

8

Page 9: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

DataSets and Transformations

ExecutionEnvironment env =

ExecutionEnvironment.getExecutionEnvironment();

DataSet<String> input = env.readTextFile(input);

DataSet<String> first = input

.filter (str -> str.contains(“Apache Flink“));

DataSet<String> second = first

.map(str -> str.toLowerCase());

second.print();

env.execute();

Input First Secondfilter map

9

Page 10: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Expressive Transformations

• Element-wise– map, flatMap, filter, project

• Group-wise– groupBy, reduce, reduceGroup, combineGroup, mapPartition, aggregate, distinct

• Binary– join, coGroup, union, cross

• Iterations– iterate, iterateDelta

• Physical re-organization– rebalance, partitionByHash, sortPartition

• Streaming– Window, windowMap, coMap, ...

10

Page 11: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Rich Type System

• Use any Java/Scala classes as a data type

– Tuples, POJOs, and case classes

– Not restricted to key-value pairs

• Define (composite) keys directly on data types

– Expression

– Tuple position

– Selector function

11

Page 12: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Counting Words in Batch and Stream

12

case class Word (word: String, frequency: Int)

val lines: DataStream[String] = env.fromSocketStream(...)

lines.flatMap {line => line.split(" ")

.map(word => Word(word,1))}

.window(Count.of(1000)).every(Count.of(100))

.groupBy("word").sum("frequency")

.print()

val lines: DataSet[String] = env.readTextFile(...)

lines.flatMap {line => line.split(" ")

.map(word => Word(word,1))}

.groupBy("word").sum("frequency")

.print()

DataSet API (batch):

DataStream API (streaming):

Page 13: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Table API

• Execute SQL-like expressions on table data

– Tight integration with Java and Scala APIs

– Available for batch and streaming programs

val orders = env.readCsvFile(…)

.as('oId, 'oDate, 'shipPrio)

.filter('shipPrio === 5)

val items = orders

.join(lineitems).where('oId === 'id)

.select('oId, 'oDate, 'shipPrio,

'extdPrice * (Literal(1.0f) - 'discnt) as 'revenue)

val result = items

.groupBy('oId, 'oDate, 'shipPrio)

.select('oId, 'revenue.sum, 'oDate, 'shipPrio)

13

Page 14: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Libraries are emerging

• As part of the Apache Flink project

– Gelly: Graph processing and analysis

– Flink ML: Machine-learning pipelines and algorithms

– Libraries are built on APIs and can be mixed with them

• Outside of Apache Flink

– Apache SAMOA (incubating)

– Apache MRQL (incubating)

– Google DataFlow translator

14

Page 15: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

WHAT IS HAPPENING INSIDE?

Processing Engine

15

Page 16: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

System Architecture

Cost-based

optimizer

Type extraction

stack

Memory

manager

Out-of-core

algos

Task

scheduling

Recovery

metadata

Data

serialization

stack

Client (pre-flight) Master

Workers

...

Flink

Program

...

Pipelined or Blocking

Data Transfer

Coordination

16

Page 17: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Cool technology inside Flink

• Batch and Streaming in one system

• Memory-safe execution

• Built-in data flow iterations

• Cost-based data flow optimizer

• Flexible windows on data streams

• Type extraction and serialization utilities

• Static code analysis on user functions

• and much more...

17

Page 18: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

STREAM AND BATCH IN ONE SYSTEM

Pipelined Data Transfer

18

Page 19: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Stream and Batch in one System

• Most systems are either stream or batch systems

• In the past, Flink focused on batch processing

– Flink‘s runtime has always done stream processing

– Operators pipeline data forward as soon as it is processed

– Some operators are blocking (such as sort)

• Stream API and operators are recent contributions

– Evolving very quickly under heavy development

19

Page 20: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Pipelined Data Transfer

• Pipelined data transfer has many benefits

– True stream and batch processing in one stack

– Avoids materialization of large intermediate results

– Better performance for many batch workloads

• Flink supports blocking data transfer as well

20

Page 21: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Pipelined Data Transfer

Large

Input

Small

Input

Small

InputResultjoin

Large

Inputmap

Interm.

DataSet

Build

HTResult

Program

Pipelined

Execution

Pipeline 1

Pipeline 2

joinProbe

HT

map No intermediate

materialization!

21

Page 22: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

MEMORY SAFE EXECUTION

Memory Management and Out-of-Core Algorithms

22

Page 23: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Memory-safe Execution

• Challenge of JVM-based data processing systems

– OutOfMemoryErrors due to data objects on the heap

• Flink runs complex data flows without memory tuning

– C++-style memory management

– Robust out-of-core algorithms

23

Page 24: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Managed Memory

• Active memory management

– Workers allocate 70% of JVM memory as byte arrays

– Algorithms serialize data objects into byte arrays

– In-memory processing as long as data is small enough

– Otherwise partial destaging to disk

• Benefits

– Safe memory bounds (no OutOfMemoryError)

– Scales to very large JVMs

– Reduced GC pressure

24

Page 25: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Going out-of-core

Single-core join of 1KB Java objects beyond memory (4 GB)

Blue bars are in-memory, orange bars (partially) out-of-core

25

Page 26: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

GRAPH ANALYSIS

Native Data Flow Iterations

26

Page 27: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Native Data Flow Iterations

• Many graph and ML algorithms require iterations

• Flink features native data flow iterations– Loops are not unrolled

– But executed as cyclic data flows

• Two types of iterations– Bulk iterations

– Delta iterations

• Performance competitive with specialized systems

27

2

1

5

43

0.1

0.5

0.2

0.4

0.70.3

0.9

Page 28: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Iterative Data Flows

• Flink runs iterations „natively“ as cyclic data flows

– Operators are scheduled once

– Data is fed back through backflow channel

– Loop-invariant data is cached

• Operator state is preserved across iterations!

initial

result

interm.

resultresultreducejoin

interm.

result

other

datasets

Replace

28

Page 29: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Delta Iterations

• Delta iteration computes

– Delta update of solution set

– Work set for next iteration

• Work set drives computations of next iteration

– Workload of later iterations significantly reduced

– Fast convergence

• Applicable to certain problem domains

– Graph processing

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

45000000

1 6 11 16 21 26 31 36 41 46 51 56 61

# o

f ele

ments

update

d

# of iterations

29

Page 30: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Iteration Performance

PageRank on Twitter Follower Graph

30 Iterations

61 Iterations (Convergence)

30

Page 31: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

WHAT IS COMING NEXT?

Roadmap

31

Page 32: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Flink’s Roadmap

Mission: Unified stream and batch processing

• Exactly-once streaming semantics with

flexible state checkpointing

• Extending the ML library

• Extending graph library

• Interactive programs

• Integration with Apache Zeppelin (incubating)

• SQL on top of expression language

• And much more…

32

Page 33: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

tl;dr – What’s worth to remember?

• Flink is general-purpose analytics system

• Unifies streaming and batch processing

• Expressive high-level APIs

• Robust and fast execution engine

34

Page 34: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

I Flink, do you? ;-)

If you find this exciting,

get involved and start a discussion on Flink‘s ML

or stay tuned by

subscribing to [email protected] or

following @ApacheFlink on Twitter

35

Page 35: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

36

Page 36: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

BACKUP

37

Page 37: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Data Flow Optimizer

• Database-style optimizations for parallel data flows

• Optimizes all batch programs

• Optimizations

– Task chaining

– Join algorithms

– Re-use partitioning and sorting for later operations

– Caching for iterations

38

Page 38: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Data Flow Optimizer

val orders = …

val lineitems = …

val filteredOrders = orders

.filter(o => dataFormat.parse(l.shipDate).after(date))

.filter(o => o.shipPrio > 2)

val lineitemsOfOrders = filteredOrders

.join(lineitems)

.where(“orderId”).equalTo(“orderId”)

.apply((o,l) => new SelectedItem(o.orderDate, l.extdPrice))

val priceSums = lineitemsOfOrders

.groupBy(“orderDate”)

.sum(“l.extdPrice”);

39

Page 39: Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Data Flow Optimizer

DataSourceorders.tbl

Filter DataSourcelineitem.tbl

JoinHybrid Hash

buildHT probe

broadcast forward

Combine

Reduce

sort[0,1]

DataSourceorders.tbl

Filter DataSourcelineitem.tbl

JoinHybrid Hash

buildHT probe

hash-part [0] hash-part [0]

hash-part [0,1]

Reduce

sort[0,1]

Best plan

depends on

relative sizes

of input filespartial sort[0,1]

40