Apache Flink Berlin Meetup May 2016

Stephan Ewen@stephanewen

What's coming up inApache Flink?Quick teaser of some of the upcoming features

Disclaimer

2

This list of threads is incomplete

This is not an Apache Flink roadmap!

What's coming up?

3

APIs

Integration Operations

Stream SQL

Queryable State

Cassandra

Deployment and Management(YARN, Mesos, Docker, …)

Dynamically ScalingStreaming Programs

Metrics

File System Sources

Side InputsJoining streamsand static data

BigTopIntegration

KinesisState Scalability

4

Stream SQL

Two definitions of Stream SQL

1. Run a continuous SQL query that reads an infinitestream and continuously produces results

2. Continuously ingest streams into a warehouse.Query the real time data in the warehouse.

5

Two definitions of Stream SQL

1. Run a continuous SQL query that reads an infinitestream and continuously produces results

2. Continuously ingest streams into a warehouse.Query the real time data in the warehouse.

6

That's Flink's Stream SQL

Good use case for Kafka + Flink + Druid

An Example

7

val execEnv = StreamExecutionEnvironment.getExecutionEnvironmentval tableEnv = TableEnvironment.getTableEnvironment(execEnv)

// define a JSON encoded Kafka topic as external tableval sensorSource = new KafkaJsonSource[(String, Long, Double)]("sensorTopic", kafkaProps, ("location", "time", "tempF"))

// register external tabletableEnv.registerTableSource("sensorData", sensorSource)

// define query in external tableval roomSensors: Table = tableEnv.sql(""" SELECT STREAM time, location AS room, (tempF - 32) * 0.556 AS tempC FROM sensorData WHERE location LIKE 'room%' """)

// write the table back to Kafka as JSONroomSensors.toSink(new KafkaJsonSink(...))

The Implementation

8Flink 1.0 Flink 1.1 +

9

Queryable State

Sharing State with Applications

10

Access to the stream aggregates with a latency bound Write them to a key/value store

Sharing State with Applications

11

Access to the stream aggregates with a latency bound Write them to a key/value store

Often the biggestbottleneck

Queryable State

12

Optional, andonly at the end of

windows

Send queries to Flink's internal state

What does it bring? Fewer moving parts in the infrastructure Performance!

From an extension of Yahoo!'s streaming benchmark:• With key/value store: 280,000 events/s• Queryable state: 15,000,000 events/s

What's the secret?• No synchronous distributed communication• Persistence via Flink's checkpoint (async snapshots)

13

14

Dynamic Scaling

Adjust parallelism of Streaming Programs

15

Initialconfiguration

Scale Out(for load)

Scale In(save resources)

Adjust parallelism of Streaming Programs Adjusting parallelism without (significantly) interrupting the

program

Initial version:• Savepoint -> stop -> restart-with-different-parallelism

Stateless operators: Trivial Stateful operators: Repartition state

• State reorganized by key for key/value state and windows

16

Consistent Hashing

17

Redistribution via Key Groups

18

Redistribution via Key Groups Flink 1.0: Hash keys into parallel partitions. Finest granularity is a partition.

Flink 1.1: Hash keys into KeyGroups. Assign KeyGroups to parallel partitions Change of parallelism means change of assignment of

KeyGroups to parallel partitions

19

Flink Forward 2016, Berlin

Submission deadline: June 30, 2016Early bird deadline: July 15, 2016

www.flink-forward.org

http://www.flink-forward.org/

We are hiring!data-artisans.com/careers

Apache Flink Berlin Meetup May 2016

Software