Top Banner
Ufuk Celebi @iamuce The Stream Processor as a Database
30

The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

Jul 12, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

UfukCelebi@iamuce

The Stream Processoras a Database

Page 2: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

The(Classic)UseCaseRealtimeCountsandAggregates

2

Page 3: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

(Real-)TimeSeriesStatistics

3

StreamofEvents Real-timeStatistics

Page 4: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TheArchitecture

4

collect messagequeue

analyze serve&store

Page 5: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TheFlinkJob

5

case class Impressions(id: String, impressions: Long)

val events: DataStream[Event] = env.addSource(new FlinkKafkaConsumer09(…))

val impressions: DataStream[Impressions] = events.filter(evt => evt.isImpression).map(evt => Impressions(evt.id, evt.numImpressions)

val counts: DataStream[Impressions]= stream.keyBy("id").timeWindow(Time.hours(1)).sum("impressions")

Page 6: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TheFlinkJob

6

case class Impressions(id: String, impressions: Long)

val events: DataStream[Event] = env.addSource(new FlinkKafkaConsumer09(…))

val impressions: DataStream[Impressions] = events.filter(evt => evt.isImpression).map(evt => Impressions(evt.id, evt.numImpressions)

val counts: DataStream[Impressions]= stream.keyBy("id").timeWindow(Time.hours(1)).sum("impressions")

Page 7: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TheFlinkJob

7

case class Impressions(id: String, impressions: Long)

val events: DataStream[Event] = env.addSource(new FlinkKafkaConsumer09(…))

val impressions: DataStream[Impressions] = events.filter(evt => evt.isImpression).map(evt => Impressions(evt.id, evt.numImpressions)

val counts: DataStream[Impressions]= stream.keyBy("id").timeWindow(Time.hours(1)).sum("impressions")

Page 8: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TheFlinkJob

8

case class Impressions(id: String, impressions: Long)

val events: DataStream[Event] = env.addSource(new FlinkKafkaConsumer09(…))

val impressions: DataStream[Impressions] = events.filter(evt => evt.isImpression).map(evt => Impressions(evt.id, evt.numImpressions)

val counts: DataStream[Impressions]= stream.keyBy("id").timeWindow(Time.hours(1)).sum("impressions")

Page 9: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TheFlinkJob

9

case class Impressions(id: String, impressions: Long)

val events: DataStream[Event] = env.addSource(new FlinkKafkaConsumer09(…))

val impressions: DataStream[Impressions] = events.filter(evt => evt.isImpression).map(evt => Impressions(evt.id, evt.numImpressions)

val counts: DataStream[Impressions]= stream.keyBy("id").timeWindow(Time.hours(1)).sum("impressions")

Page 10: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TheFlinkJob

10

KafkaSource map() window()/

sum() Sink

KafkaSource map() window()/

sum() Sink

filter()

filter()

keyBy()

keyBy()

State

State

Page 11: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

Puttingitalltogether

11

Periodically(everysecond)flushnewaggregates

toRedis

Page 12: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TheBottleneck

12

Writestothekey/valuestoretaketoolong

Page 13: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

Queryable State

13

Page 14: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

QueryableState

14

Page 15: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

QueryableState

15

Optional,andonlyattheendof

windows

Page 16: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

QueryableState:ApplicationView

16

Database

realtimeresults olderresults

Application QueryService

currenttimewindows

pasttimewindows

Page 17: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

QueryableStateEnablers§ Flinkhasstateasafirstclasscitizen

§ Stateisfaulttolerant (exactlyoncesemantics)

§ Stateispartitioned (sharded)togetherwiththeoperatorsthatcreate/updateit

§ Stateiscontinuous (notminibatched)

§ Stateisscalable

17

Page 18: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

StateinFlink

18

window()/sum()

Source/filter()/map()

Stateindex(e.g.,RocksDB)

Eventsarepersistentandordered (perpartition/key)

inthemessagequeue(e.g.,ApacheKafka)

Eventsflowwithoutreplicationor synchronouswrites

Page 19: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

StateinFlink

19

window()/sum()

Source/filter()/map()

Triggercheckpoint Injectcheckpointbarrier

Page 20: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

StateinFlink

20

window()/sum()

Source/filter()/map()

Takestatesnapshot Triggerstatecopy-on-write

Page 21: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

StateinFlink

21

window()/sum()

Source/filter()/map()

Persiststatesnapshots Durablypersistsnapshots

asynchronously

Processingpipelinecontinues

Page 22: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

QueryableState:Implementation

22

QueryClient

StateRegistry

window()/sum()

JobManager TaskManager

ExecutionGraph

StateLocationServer

deploy

status

Query:/job/state-name/key

StateRegistry

window()/sum()

TaskManager

(1)Getlocationof"key-partition"of"job"

(2)Lookuplocation

(3)Respondlocation

(4)Querystate-nameandkey

localstate

register

Page 23: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

QueryableStatePerformance

23

Page 24: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

Conclusion

24

Page 25: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

Takeaways§ Streamingapplicationsareoftennotboundbythestream

processoritself.Crosssysteminteraction isfrequentlybiggestbottleneck

§ Queryablestatemitigatesabigbottleneck:Communicationwithexternalkey/valuestorestopublishrealtimeresults

§ ApacheFlink'ssophisticatedsupportforstatemakesthispossible

25

Page 26: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TakeawaysPerformanceofQueryableState

§ Datapersistenceisfastwithlogs• Appendonly,andstreamingreplication

§ Computedstateisfastwithlocaldatastructuresandnosynchronousreplication

§ Flink'scheckpointmethodmakescomputedstatepersistentwithlowoverhead

26

Page 27: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

Questions?§ eMail:[email protected]§ Twitter:@iamuce§ Code/Demo:https://github.com/dataArtisans/flink-

queryable_state_demo

27

Page 28: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

Appendix

28

Page 29: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

Flink Runtime+APIs

29

DataStreamAPI

RuntimeDistributedStreamingDataFlow

TableAPI&StreamSQL

ProcessFunction API

Building Blocks: Streams, Time, State

Page 30: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

ApacheFlinkArchitectureReview

30