Top Banner
Scalable Stream Processing - Spark Streaming and Beam Amir H. Payberah [email protected] 26/09/2019
104

Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

May 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Scalable Stream Processing - Spark Streaming and Beam

Amir H. [email protected]

26/09/2019

Page 2: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

The Course Web Page

https://id2221kth.github.io

1 / 65

Page 3: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Where Are We?

2 / 65

Page 4: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Stream Processing Systems Design Issues

I Continuous vs. micro-batch processing

I Record-at-a-Time vs. declarative APIs

3 / 65

Page 5: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Spark Streaming

4 / 65

Page 6: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Contribution

I Design issues• Continuous vs. micro-batch processing• Record-at-a-Time vs. declarative APIs

5 / 65

Page 7: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Spark Streaming

I Run a streaming computation as a series of very small, deterministic batch jobs.

• Chops up the live stream into batches of X seconds.

• Treats each batch as RDDs and processes them using RDD operations.

• Discretized Stream Processing (DStream)

6 / 65

Page 8: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Spark Streaming

I Run a streaming computation as a series of very small, deterministic batch jobs.

• Chops up the live stream into batches of X seconds.

• Treats each batch as RDDs and processes them using RDD operations.

• Discretized Stream Processing (DStream)

6 / 65

Page 9: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Spark Streaming

I Run a streaming computation as a series of very small, deterministic batch jobs.

• Chops up the live stream into batches of X seconds.

• Treats each batch as RDDs and processes them using RDD operations.

• Discretized Stream Processing (DStream)

6 / 65

Page 10: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

DStream (1/2)

I DStream: sequence of RDDs representing a stream of data.

7 / 65

Page 11: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

DStream (1/2)

I DStream: sequence of RDDs representing a stream of data.

7 / 65

Page 12: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

DStream (2/2)

I Any operation applied on a DStream translates to operations on the underlying RDDs.

8 / 65

Page 13: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

StreamingContext

I StreamingContext is the main entry point of all Spark Streaming functionality.

val conf = new SparkConf().setAppName(appName).setMaster(master)

val ssc = new StreamingContext(conf, Seconds(1))

I The second parameter, Seconds(1), represents the time interval at which streamingdata will be divided into batches.

9 / 65

Page 14: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Input Operations

I Every input DStream is associated with a Receiver object.• It receives the data from a source and stores it in Spark’s memory for processing.

I Basic sources directly available in the StreamingContext API, e.g., file systems,socket connections.

I Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter.

10 / 65

Page 15: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Input Operations

I Every input DStream is associated with a Receiver object.• It receives the data from a source and stores it in Spark’s memory for processing.

I Basic sources directly available in the StreamingContext API, e.g., file systems,socket connections.

I Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter.

10 / 65

Page 16: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Input Operations

I Every input DStream is associated with a Receiver object.• It receives the data from a source and stores it in Spark’s memory for processing.

I Basic sources directly available in the StreamingContext API, e.g., file systems,socket connections.

I Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter.

10 / 65

Page 17: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Input Operations - Basic Sources

I Socket connection• Creates a DStream from text data received over a TCP socket connection.

ssc.socketTextStream("localhost", 9999)

I File stream• Reads data from files.

streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)

streamingContext.textFileStream(dataDirectory)

11 / 65

Page 18: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Input Operations - Basic Sources

I Socket connection• Creates a DStream from text data received over a TCP socket connection.

ssc.socketTextStream("localhost", 9999)

I File stream• Reads data from files.

streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)

streamingContext.textFileStream(dataDirectory)

11 / 65

Page 19: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Input Operations - Advanced Sources

I Connectors with external sources

I Twitter, Kafka, Flume, Kinesis, ...

TwitterUtils.createStream(ssc, None)

KafkaUtils.createStream(ssc, [ZK quorum], [consumer group id], [number of partitions])

12 / 65

Page 20: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Transformations (1/2)

I Transformations on DStreams are still lazy!

I DStreams support many of the transformations available on normal Spark RDDs.

I Computation is kicked off explicitly by a call to the start() method.

13 / 65

Page 21: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Transformations (1/2)

I Transformations on DStreams are still lazy!

I DStreams support many of the transformations available on normal Spark RDDs.

I Computation is kicked off explicitly by a call to the start() method.

13 / 65

Page 22: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Transformations (2/2)

I map: a new DStream by passing each element of the source DStream through a givenfunction.

I reduce: a new DStream of single-element RDDs by aggregating the elements ineach RDD using a given function.

I reduceByKey: a new DStream of (K, V) pairs where the values for each key areaggregated using the given reduce function.

14 / 65

Page 23: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Transformations (2/2)

I map: a new DStream by passing each element of the source DStream through a givenfunction.

I reduce: a new DStream of single-element RDDs by aggregating the elements ineach RDD using a given function.

I reduceByKey: a new DStream of (K, V) pairs where the values for each key areaggregated using the given reduce function.

14 / 65

Page 24: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Transformations (2/2)

I map: a new DStream by passing each element of the source DStream through a givenfunction.

I reduce: a new DStream of single-element RDDs by aggregating the elements ineach RDD using a given function.

I reduceByKey: a new DStream of (K, V) pairs where the values for each key areaggregated using the given reduce function.

14 / 65

Page 25: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example - Word Count (1/6)

I First we create a StreamingContex

import org.apache.spark._

import org.apache.spark.streaming._

// Create a local StreamingContext with two working threads and batch interval of 1 second.

val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

15 / 65

Page 26: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example - Word Count (2/6)

I Create a DStream that represents streaming data from a TCP source.

I Specified as hostname (e.g., localhost) and port (e.g., 9999).

val lines = ssc.socketTextStream("localhost", 9999)

16 / 65

Page 27: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example - Word Count (3/6)

I Use flatMap on the stream to split the records text to words.

I It creates a new DStream.

val words = lines.flatMap(_.split(" "))

17 / 65

Page 28: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example - Word Count (4/6)

I Map the words DStream to a DStream of (word, 1).

I Get the frequency of words in each batch of data.

I Finally, print the result.

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

18 / 65

Page 29: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example - Word Count (5/6)

I Start the computation and wait for it to terminate.

// Start the computation

ssc.start()

// Wait for the computation to terminate

ssc.awaitTermination()

19 / 65

Page 30: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example - Word Count (6/6)

val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()

ssc.awaitTermination()

20 / 65

Page 31: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Window Operations (1/2)

I Spark provides a set of transformations that apply to a over a sliding window of data.

I A window is defined by two parameters: window length and slide interval.

I A tumbling window effect can be achieved by making slide interval = window length

21 / 65

Page 32: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Window Operations (1/2)

I Spark provides a set of transformations that apply to a over a sliding window of data.

I A window is defined by two parameters: window length and slide interval.

I A tumbling window effect can be achieved by making slide interval = window length

21 / 65

Page 33: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Window Operations (1/2)

I Spark provides a set of transformations that apply to a over a sliding window of data.

I A window is defined by two parameters: window length and slide interval.

I A tumbling window effect can be achieved by making slide interval = window length

21 / 65

Page 34: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Window Operations (2/2)

I window(windowLength, slideInterval)• Returns a new DStream which is computed based on windowed batches.

I reduceByWindow(func, windowLength, slideInterval)• Returns a new single-element DStream, created by aggregating elements in the stream

over a sliding interval using func.

I reduceByKeyAndWindow(func, windowLength, slideInterval)• Called on a DStream of (K, V) pairs.• Returns a new DStream of (K, V) pairs where the values for each key are aggregated

using function func over batches in a sliding window.

22 / 65

Page 35: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Window Operations (2/2)

I window(windowLength, slideInterval)• Returns a new DStream which is computed based on windowed batches.

I reduceByWindow(func, windowLength, slideInterval)• Returns a new single-element DStream, created by aggregating elements in the stream

over a sliding interval using func.

I reduceByKeyAndWindow(func, windowLength, slideInterval)• Called on a DStream of (K, V) pairs.• Returns a new DStream of (K, V) pairs where the values for each key are aggregated

using function func over batches in a sliding window.

22 / 65

Page 36: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Window Operations (2/2)

I window(windowLength, slideInterval)• Returns a new DStream which is computed based on windowed batches.

I reduceByWindow(func, windowLength, slideInterval)• Returns a new single-element DStream, created by aggregating elements in the stream

over a sliding interval using func.

I reduceByKeyAndWindow(func, windowLength, slideInterval)• Called on a DStream of (K, V) pairs.• Returns a new DStream of (K, V) pairs where the values for each key are aggregated

using function func over batches in a sliding window.

22 / 65

Page 37: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example - Word Count with Window

val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val windowedWordCounts = pairs.reduceByKeyAndWindow(_ + _, Seconds(30), Seconds(10))

windowedWordCounts.print()

ssc.start()

ssc.awaitTermination()

23 / 65

Page 38: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

What about States?

I Accumulate and aggregate the results from the start of the streaming job.

I Need to check the previous state of the RDD in order to do something with thecurrent RDD.

I Spark supports stateful streams.

24 / 65

Page 39: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

What about States?

I Accumulate and aggregate the results from the start of the streaming job.

I Need to check the previous state of the RDD in order to do something with thecurrent RDD.

I Spark supports stateful streams.

24 / 65

Page 40: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Checkpointing

I It is mandatory that you provide a checkpointing directory for stateful streams.

val ssc = new StreamingContext(conf, Seconds(1))

ssc.checkpoint("path/to/persistent/storage")

25 / 65

Page 41: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Stateful Stream Operations

I mapWithState• It is executed only on set of keys that are available in the last micro batch.

def mapWithState[StateType, MappedType](spec: StateSpec[K, V, StateType, MappedType]):

DStream[MappedType]

StateSpec.function(updateFunc)

val updateFunc = (batch: Time, key: String, value: Option[Int], state: State[Int])

I Define the update function (partial updates) in StateSpec.

26 / 65

Page 42: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Stateful Stream Operations

I mapWithState• It is executed only on set of keys that are available in the last micro batch.

def mapWithState[StateType, MappedType](spec: StateSpec[K, V, StateType, MappedType]):

DStream[MappedType]

StateSpec.function(updateFunc)

val updateFunc = (batch: Time, key: String, value: Option[Int], state: State[Int])

I Define the update function (partial updates) in StateSpec.

26 / 65

Page 43: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example - Stateful Word Count (1/4)

val ssc = new StreamingContext(conf, Seconds(1))

ssc.checkpoint(".")

val lines = ssc.socketTextStream(IP, Port)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val stateWordCount = pairs.mapWithState(StateSpec.function(updateFunc))

val updateFunc = (key: String, value: Option[Int], state: State[Int]) => {

val newCount = value.getOrElse(0)

val oldCount = state.getOption.getOrElse(0)

val sum = newCount + oldCount

state.update(sum)

(key, sum)

}

27 / 65

Page 44: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example - Stateful Word Count (2/4)

I The first micro batch contains a message a.

I updateFunc = (key: String, value: Option[Int], state: State[Int]) => (key, sum)

I Input: key = a, value = Some(1), state = 0

I Output: key = a, sum = 1

28 / 65

Page 45: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example - Stateful Word Count (2/4)

I The first micro batch contains a message a.

I updateFunc = (key: String, value: Option[Int], state: State[Int]) => (key, sum)

I Input: key = a, value = Some(1), state = 0

I Output: key = a, sum = 1

28 / 65

Page 46: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example - Stateful Word Count (2/4)

I The first micro batch contains a message a.

I updateFunc = (key: String, value: Option[Int], state: State[Int]) => (key, sum)

I Input: key = a, value = Some(1), state = 0

I Output: key = a, sum = 1

28 / 65

Page 47: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example - Stateful Word Count (3/4)

I The second micro batch contains messages a and b.

I updateFunc = (key: String, value: Option[Int], state: State[Int]) => (key, sum)

I Input: key = a, value = Some(1), state = 1

I Input: key = b, value = Some(1), state = 0

I Output: key = a, sum = 2

I Output: key = b, sum = 1

29 / 65

Page 48: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example - Stateful Word Count (3/4)

I The second micro batch contains messages a and b.

I updateFunc = (key: String, value: Option[Int], state: State[Int]) => (key, sum)

I Input: key = a, value = Some(1), state = 1

I Input: key = b, value = Some(1), state = 0

I Output: key = a, sum = 2

I Output: key = b, sum = 1

29 / 65

Page 49: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example - Stateful Word Count (3/4)

I The second micro batch contains messages a and b.

I updateFunc = (key: String, value: Option[Int], state: State[Int]) => (key, sum)

I Input: key = a, value = Some(1), state = 1

I Input: key = b, value = Some(1), state = 0

I Output: key = a, sum = 2

I Output: key = b, sum = 1

29 / 65

Page 50: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example - Stateful Word Count (4/4)

I The third micro batch contains a message b.

I updateFunc = (key: String, value: Option[Int], state: State[Int]) => (key, sum)

I Input: key = b, value = Some(1), state = 1

I Output: key = b, sum = 2

30 / 65

Page 51: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example - Stateful Word Count (4/4)

I The third micro batch contains a message b.

I updateFunc = (key: String, value: Option[Int], state: State[Int]) => (key, sum)

I Input: key = b, value = Some(1), state = 1

I Output: key = b, sum = 2

30 / 65

Page 52: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example - Stateful Word Count (4/4)

I The third micro batch contains a message b.

I updateFunc = (key: String, value: Option[Int], state: State[Int]) => (key, sum)

I Input: key = b, value = Some(1), state = 1

I Output: key = b, sum = 2

30 / 65

Page 53: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Google Dataflow and Beam

31 / 65

Page 54: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

History

I Google’s Zeitgeist: tracking trends in web queries.

I Builds a historical model of each query.

I Google discontinued Zeitgeist, but most of its features can be found in Google Trends.

32 / 65

Page 55: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

MillWheel Dataflow

I MillWheel is a framework for building low-latency data-processing applications.

I A dataflow graph of transformations (computations).

I Stream: unbounded data of (key, value, timestamp) records.• Timestamp: event-time

33 / 65

Page 56: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

MillWheel Dataflow

I MillWheel is a framework for building low-latency data-processing applications.

I A dataflow graph of transformations (computations).

I Stream: unbounded data of (key, value, timestamp) records.• Timestamp: event-time

33 / 65

Page 57: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

MillWheel Dataflow

I MillWheel is a framework for building low-latency data-processing applications.

I A dataflow graph of transformations (computations).

I Stream: unbounded data of (key, value, timestamp) records.• Timestamp: event-time

33 / 65

Page 58: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Key Extraction Function and Computations

I Stream of (key, value, timestamp) records.

I Key extraction function: specified by the stream consumer to assign keys to records.

I Computation can only access state for the specific key.

I Multiple computations can extract different keys from the same stream.

34 / 65

Page 59: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Key Extraction Function and Computations

I Stream of (key, value, timestamp) records.

I Key extraction function: specified by the stream consumer to assign keys to records.

I Computation can only access state for the specific key.

I Multiple computations can extract different keys from the same stream.

34 / 65

Page 60: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Persistent State

I Keep the states of the computations

I Managed on per-key basis

I Stored in Bigtable or Spanner

I Common use: aggregation, joins, ...

35 / 65

Page 61: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Delivery Guarantees

I Emitted records are checkpointed before delivery.• The checkpoints allow fault-tolerance.

I When a delivery is ACKed the checkpoints can be garbage collected.

I If an ACK is not received, the record can be re-sent.

I Exactly-one delivery: duplicates are discarded by MillWheel at the recipient.

36 / 65

Page 62: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Delivery Guarantees

I Emitted records are checkpointed before delivery.• The checkpoints allow fault-tolerance.

I When a delivery is ACKed the checkpoints can be garbage collected.

I If an ACK is not received, the record can be re-sent.

I Exactly-one delivery: duplicates are discarded by MillWheel at the recipient.

36 / 65

Page 63: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Delivery Guarantees

I Emitted records are checkpointed before delivery.• The checkpoints allow fault-tolerance.

I When a delivery is ACKed the checkpoints can be garbage collected.

I If an ACK is not received, the record can be re-sent.

I Exactly-one delivery: duplicates are discarded by MillWheel at the recipient.

36 / 65

Page 64: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Delivery Guarantees

I Emitted records are checkpointed before delivery.• The checkpoints allow fault-tolerance.

I When a delivery is ACKed the checkpoints can be garbage collected.

I If an ACK is not received, the record can be re-sent.

I Exactly-one delivery: duplicates are discarded by MillWheel at the recipient.

36 / 65

Page 65: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

What is Google Cloud Dataflow?

37 / 65

Page 66: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Google Cloud Dataflow (1/2)

I Google managed service for unified batch and stream data processing.

38 / 65

Page 67: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Google Cloud Dataflow (2/2)

I Open source Cloud Dataflow SDK

I Express your data processing pipeline using FlumeJava.

I If you run it in batch mode, it executed on the MapReduce framework.

I If you run it in streaming mode, it is executed on the MillWheel framework.

39 / 65

Page 68: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Google Cloud Dataflow (2/2)

I Open source Cloud Dataflow SDK

I Express your data processing pipeline using FlumeJava.

I If you run it in batch mode, it executed on the MapReduce framework.

I If you run it in streaming mode, it is executed on the MillWheel framework.

39 / 65

Page 69: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Google Cloud Dataflow (2/2)

I Open source Cloud Dataflow SDK

I Express your data processing pipeline using FlumeJava.

I If you run it in batch mode, it executed on the MapReduce framework.

I If you run it in streaming mode, it is executed on the MillWheel framework.

39 / 65

Page 70: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Programming Model

I Pipeline, a directed graph of data processing transformations

I Optimized and executed as a unit

I May include multiple inputs and multiple outputs

I May encompass many logical MapReduce or Millwheeloperations

40 / 65

Page 71: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Programming Model

I Pipeline, a directed graph of data processing transformations

I Optimized and executed as a unit

I May include multiple inputs and multiple outputs

I May encompass many logical MapReduce or Millwheeloperations

40 / 65

Page 72: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Programming Model

I Pipeline, a directed graph of data processing transformations

I Optimized and executed as a unit

I May include multiple inputs and multiple outputs

I May encompass many logical MapReduce or Millwheeloperations

40 / 65

Page 73: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Programming Model

I Pipeline, a directed graph of data processing transformations

I Optimized and executed as a unit

I May include multiple inputs and multiple outputs

I May encompass many logical MapReduce or Millwheeloperations

40 / 65

Page 74: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Windowing and Triggering

I Windowing determines where in event time data are grouped together for processing.

• Fixed time windows (tumbling windows)• Sliding time windows• Session windows

I Triggering determines when in processing time the results of groupings are emittedas panes.

• Time-based triggers• Data-driven triggers• Composit triggers

41 / 65

Page 75: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Windowing and Triggering

I Windowing determines where in event time data are grouped together for processing.• Fixed time windows (tumbling windows)• Sliding time windows• Session windows

I Triggering determines when in processing time the results of groupings are emittedas panes.

• Time-based triggers• Data-driven triggers• Composit triggers

41 / 65

Page 76: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Windowing and Triggering

I Windowing determines where in event time data are grouped together for processing.• Fixed time windows (tumbling windows)• Sliding time windows• Session windows

I Triggering determines when in processing time the results of groupings are emittedas panes.

• Time-based triggers• Data-driven triggers• Composit triggers

41 / 65

Page 77: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Windowing and Triggering

I Windowing determines where in event time data are grouped together for processing.• Fixed time windows (tumbling windows)• Sliding time windows• Session windows

I Triggering determines when in processing time the results of groupings are emittedas panes.

• Time-based triggers• Data-driven triggers• Composit triggers

41 / 65

Page 78: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example (1/3)

I Batch processing

42 / 65

Page 79: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example (2/3)

I Trigger at period (time-based triggers)

I Trigger at count (data-driven triggers)

43 / 65

Page 80: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example (2/3)

I Trigger at period (time-based triggers)

I Trigger at count (data-driven triggers)

43 / 65

Page 81: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example (3/3)

I Fixed window, trigger at period (micro-batch)

I Fixed window, trigger at watermark (streaming)

44 / 65

Page 82: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example (3/3)

I Fixed window, trigger at period (micro-batch)

I Fixed window, trigger at watermark (streaming)

44 / 65

Page 83: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Where is Apache Beam?

45 / 65

Page 84: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

From Google Cloud Dataflow to Apache Beam

I In 2016, Google Cloud Dataflow team announced its intention to donate the pro-gramming model and SDKs to the Apache Software Foundation.

I That resulted in the incubating project Apache Beam.

46 / 65

Page 85: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

From Google Cloud Dataflow to Apache Beam

I In 2016, Google Cloud Dataflow team announced its intention to donate the pro-gramming model and SDKs to the Apache Software Foundation.

I That resulted in the incubating project Apache Beam.

46 / 65

Page 86: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Programming Components

I Pipelines

I PCollections

I Transforms

I I/O sources and sinks

47 / 65

Page 87: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Pipelines (1/2)

I A pipeline represents a data processing job.

I Directed graph of operating on data.

I A pipeline consists of two parts:• Data (PCollection)• Transforms applied to that data

48 / 65

Page 88: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Pipelines (2/2)

public static void main(String[] args) {

// Create a pipeline

PipelineOptions options = PipelineOptionsFactory.create();

Pipeline p = Pipeline.create(options);

p.apply(TextIO.Read.from("gs://...")) // Read input.

.apply(new CountWords()) // Do some processing.

.apply(TextIO.Write.to("gs://...")); // Write output.

// Run the pipeline.

p.run();

}

49 / 65

Page 89: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

PCollections (1/2)

I A parallel collection of records

I Immutable

I Must specify bounded or unbounded

50 / 65

Page 90: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

PCollections (2/2)

// Create a Java Collection, in this case a List of Strings.

static final List<String> LINES = Arrays.asList("line 1", "line 2", "line 3");

PipelineOptions options = PipelineOptionsFactory.create();

Pipeline p = Pipeline.create(options);

// Create the PCollection

p.apply(Create.of(LINES)).setCoder(StringUtf8Coder.of())

51 / 65

Page 91: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Transformations

I A processing operation that transforms data

I Each transform accepts one (or multiple) PCollections as input, performs an op-eration, and produces one (or multiple) new PCollections as output.

I Core transforms: ParDo, GroupByKey, Combine, Flatten

52 / 65

Page 92: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Transformations - ParDo

I Processes each element of a PCollection independently using a user-provided DoFn.

// The input PCollection of Strings.

PCollection<String> words = ...;

// The DoFn to perform on each element in the input PCollection.

static class ComputeWordLengthFn extends DoFn<String, Integer> { ... }

// Apply a ParDo to the PCollection "words" to compute lengths for each word.

PCollection<Integer> wordLengths = words.apply(ParDo.of(new ComputeWordLengthFn()));

53 / 65

Page 93: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Transformations - GroupByKey

I Takes a PCollection of key-value pairs and gathers up all values with the same key.

// A PCollection of key/value pairs: words and line numbers.

PCollection<KV<String, Integer>> wordsAndLines = ...;

// Apply a GroupByKey transform to the PCollection "wordsAndLines".

PCollection<KV<String, Iterable<Integer>>> groupedWords = wordsAndLines.apply(

GroupByKey.<String, Integer>create());

54 / 65

Page 94: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Transformations - Join and CoGroubByKey

I Groups together the values from multiple PCollections of key-value pairs.

// Each data set is represented by key-value pairs in separate PCollections.

// Both data sets share a common key type ("K").

PCollection<KV<K, V1>> pc1 = ...;

PCollection<KV<K, V2>> pc2 = ...;

// Create tuple tags for the value types in each collection.

final TupleTag<V1> tag1 = new TupleTag<V1>();

final TupleTag<V2> tag2 = new TupleTag<V2>();

// Merge collection values into a CoGbkResult collection.

PCollection<KV<K, CoGbkResult>> coGbkResultCollection =

KeyedPCollectionTuple.of(tag1, pc1)

.and(tag2, pc2)

.apply(CoGroupByKey.<K>create());

55 / 65

Page 95: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example: HashTag Autocompletion (1/3)

56 / 65

Page 96: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example: HashTag Autocompletion (2/3)

57 / 65

Page 97: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Example: HashTag Autocompletion (3/3)

58 / 65

Page 98: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Windowing (1/2)

I Fixed time windows

PCollection<String> items = ...;

PCollection<String> fixedWindowedItems = items.apply(

Window.<String>into(FixedWindows.of(Duration.standardSeconds(30))));

59 / 65

Page 99: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Windowing (2/2)

I Sliding time windows

PCollection<String> items = ...;

PCollection<String> slidingWindowedItems = items.apply(

Window.<String>into(SlidingWindows.of(Duration.standardSeconds(60))

.every(Duration.standardSeconds(30))));

60 / 65

Page 100: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Triggering

I E.g., emits results one minute after the first element in that window has been pro-cessed.

PCollection<String> items = ...;

items.apply(

Window.<String>into(FixedWindows

.of(1, TimeUnit.MINUTES))

.triggering(AfterProcessingTime.pastFirstElementInPane()

.plusDelayOf(Duration.standardMinutes(1)));

61 / 65

Page 101: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Summary

62 / 65

Page 102: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Summary

I Spark• Mini-batch processing• DStream: sequence of RDDs• RDD and window operations• Structured streaming

I Google cloud dataflow• Pipeline• PCollection: windows and triggers• Transforms

63 / 65

Page 103: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

References

I M. Zaharia et al., “Spark: The Definitive Guide”, O’Reilly Media, 2018 - Chapters20-23.

I M. Zaharia et al., “Discretized Streams: An Efficient and Fault-Tolerant Model forStream Processing on Large Clusters”, HotCloud’12.

I T. Akidau et al., “MillWheel: fault-tolerant stream processing at internet scale”,VLDB 2013.

I T. Akidau et al., “The dataflow model: a practical approach to balancing correctness,latency, and cost in massive-scale, unbounded, out-of-order data processing”, VLDB2015.

I The world beyond batch: Streaming 102https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

64 / 65

Page 104: Scalable Stream Processing - Spark Streaming and Beam · Spark Streaming I Run a streaming computation as aseriesof verysmall,deterministicbatch jobs. Chops upthe live stream into

Questions?

65 / 65