Top Banner
Stream Processing In The Cloud Amir H. Payberah [email protected] Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 1 / 47
74
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Spark Stream and SEEP

Stream Processing In The Cloud

Amir H. [email protected]

Amirkabir University of Technology(Tehran Polytechnic)

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 1 / 47

Page 2: Spark Stream and SEEP

Stream Processing In The Cloud

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 2 / 47

Page 3: Spark Stream and SEEP

Motivation

I Users of big data applications expect fresh results.

I New stream processing systems are designed to scale to large num-bers of cloud-hosted machines.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 3 / 47

Page 4: Spark Stream and SEEP

Motivation

I Clouds provide virtually infinite pools of resources.

I Fast and cheap access to new machines (VMs) for operators.

I How do you decide on the optimal number of VMs?• Over-provisioning system is expense.• Too few nodes leads to poor performance.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 4 / 47

Page 5: Spark Stream and SEEP

Challenges

I Elastic data-parallel processing

I Fault-tolerant processing

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 5 / 47

Page 6: Spark Stream and SEEP

Challenge: Elastic Data-Parallel Processing

I Typical stream processing workloads are bursty.

I High and bursty input rates → detect bottleneck + parallelize

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 6 / 47

Page 7: Spark Stream and SEEP

Challenge: Fault-Tolerant Processing

I Large scale deployment → handle node failures.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 7 / 47

Page 8: Spark Stream and SEEP

States in Stream Processing

I Many online applications, like machine learning algorithms, requirestate.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 8 / 47

Page 9: Spark Stream and SEEP

What is State?

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 9 / 47

Page 10: Spark Stream and SEEP

State Complicates Things

I Dynamic scale out impacts state.

I Recovery from failures.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 10 / 47

Page 11: Spark Stream and SEEP

State Complicates Things

I Dynamic scale out impacts state.

I Recovery from failures.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 10 / 47

Page 12: Spark Stream and SEEP

Operators States

I Stateless operators, e.g., filter and map

I Stateful operators, e.g., join and aggregate

I Window operators, use use the concept of a finite window of tuples.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 11 / 47

Page 13: Spark Stream and SEEP

Operators States

I Stateless operators, e.g., filter and map

I Stateful operators, e.g., join and aggregate

I Window operators, use use the concept of a finite window of tuples.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 11 / 47

Page 14: Spark Stream and SEEP

Operators States

I Stateless operators, e.g., filter and map

I Stateful operators, e.g., join and aggregate

I Window operators, use use the concept of a finite window of tuples.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 11 / 47

Page 15: Spark Stream and SEEP

SEEP

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 12 / 47

Page 16: Spark Stream and SEEP

Contribution

I Build a stream processing system that scale out while remainingfault tolerant when queries contain stateful operators.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 13 / 47

Page 17: Spark Stream and SEEP

Core Idea

I Make operator state an external entity that can be managed by thestream processing system.

I Operators have direct access to states.

I The system manages states.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 14 / 47

Page 18: Spark Stream and SEEP

Core Idea

I Make operator state an external entity that can be managed by thestream processing system.

I Operators have direct access to states.

I The system manages states.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 14 / 47

Page 19: Spark Stream and SEEP

Core Idea

I Make operator state an external entity that can be managed by thestream processing system.

I Operators have direct access to states.

I The system manages states.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 14 / 47

Page 20: Spark Stream and SEEP

Operator State Management

I On scale out: partition operator state correctly, maintaining consis-tency

I On failure recovery: restore state of failed operator

I Define primitives for state management and build other mechanismson top of them.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 15 / 47

Page 21: Spark Stream and SEEP

Operator State Management

I On scale out: partition operator state correctly, maintaining consis-tency

I On failure recovery: restore state of failed operator

I Define primitives for state management and build other mechanismson top of them.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 15 / 47

Page 22: Spark Stream and SEEP

Operator State Management

I On scale out: partition operator state correctly, maintaining consis-tency

I On failure recovery: restore state of failed operator

I Define primitives for state management and build other mechanismson top of them.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 15 / 47

Page 23: Spark Stream and SEEP

State Management Primitives

I Checkpoint• Makes state available to system.• Attaches last processed tuple timestamp.

I Backup/Restore• Moves copy of state from

one operator to another.

I Partition• Splits state to scale out an operator.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 16 / 47

Page 24: Spark Stream and SEEP

State Management Primitives

I Checkpoint• Makes state available to system.• Attaches last processed tuple timestamp.

I Backup/Restore• Moves copy of state from

one operator to another.

I Partition• Splits state to scale out an operator.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 16 / 47

Page 25: Spark Stream and SEEP

State Management Primitives

I Checkpoint• Makes state available to system.• Attaches last processed tuple timestamp.

I Backup/Restore• Moves copy of state from

one operator to another.

I Partition• Splits state to scale out an operator.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 16 / 47

Page 26: Spark Stream and SEEP

State Primitives: Checkpoint

I Checkpoint state = the processing state + the buffer state

I That routing state is not included in the state checkpoint.• It only changes in case of scale out or recovery.

I The system executes checkpoint asynchronously and periodically.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 17 / 47

Page 27: Spark Stream and SEEP

State Primitives: Checkpoint

I Checkpoint state = the processing state + the buffer state

I That routing state is not included in the state checkpoint.• It only changes in case of scale out or recovery.

I The system executes checkpoint asynchronously and periodically.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 17 / 47

Page 28: Spark Stream and SEEP

State Primitives: Checkpoint

I Checkpoint state = the processing state + the buffer state

I That routing state is not included in the state checkpoint.• It only changes in case of scale out or recovery.

I The system executes checkpoint asynchronously and periodically.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 17 / 47

Page 29: Spark Stream and SEEP

State Primitives: Backup and Restore (1/2)

I The operator state (i.e., the checkpoint output) is backed up to anupstream operator.

I After the operator state was backed up, already processed tuplesfrom output buffers in upstream operators can be discarded.

• They are no longer required for failure recovery.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 18 / 47

Page 30: Spark Stream and SEEP

State Primitives: Backup and Restore (1/2)

I The operator state (i.e., the checkpoint output) is backed up to anupstream operator.

I After the operator state was backed up, already processed tuplesfrom output buffers in upstream operators can be discarded.

• They are no longer required for failure recovery.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 18 / 47

Page 31: Spark Stream and SEEP

State Primitives: Backup and Restore (2/2)

I Backed up operator state is restored to another operator to recovera failed operator or to redistribute state across partitioned operators.

I After restoring the state, the system replays unprocessed tuples inthe output buffer from an upstream operator to bring the operator’sprocessing state up-to-date.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 19 / 47

Page 32: Spark Stream and SEEP

State Primitives: Backup and Restore (2/2)

I Backed up operator state is restored to another operator to recovera failed operator or to redistribute state across partitioned operators.

I After restoring the state, the system replays unprocessed tuples inthe output buffer from an upstream operator to bring the operator’sprocessing state up-to-date.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 19 / 47

Page 33: Spark Stream and SEEP

State Primitives: Partition

I Split the state of a stateful operator across the new partitionedoperators when it scales out.

I Partitioning the key space of the tuples processed by the operator.

I The routing state of its upstream operators must also be updatedto account for the new partitioned operators.

I The buffer state of the upstream operators is partitioned to ensurethat unprocessed tuples are dispatched to the correct partition.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 20 / 47

Page 34: Spark Stream and SEEP

State Primitives: Partition

I Split the state of a stateful operator across the new partitionedoperators when it scales out.

I Partitioning the key space of the tuples processed by the operator.

I The routing state of its upstream operators must also be updatedto account for the new partitioned operators.

I The buffer state of the upstream operators is partitioned to ensurethat unprocessed tuples are dispatched to the correct partition.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 20 / 47

Page 35: Spark Stream and SEEP

State Primitives: Partition

I Split the state of a stateful operator across the new partitionedoperators when it scales out.

I Partitioning the key space of the tuples processed by the operator.

I The routing state of its upstream operators must also be updatedto account for the new partitioned operators.

I The buffer state of the upstream operators is partitioned to ensurethat unprocessed tuples are dispatched to the correct partition.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 20 / 47

Page 36: Spark Stream and SEEP

State Primitives: Partition

I Split the state of a stateful operator across the new partitionedoperators when it scales out.

I Partitioning the key space of the tuples processed by the operator.

I The routing state of its upstream operators must also be updatedto account for the new partitioned operators.

I The buffer state of the upstream operators is partitioned to ensurethat unprocessed tuples are dispatched to the correct partition.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 20 / 47

Page 37: Spark Stream and SEEP

Scale Out

I To scale out queries at runtime, the system partitions operatorson-demand in response to bottleneck operators.

I The load of the bottlenecked operator is shared among a set of newpartitioned operators.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 21 / 47

Page 38: Spark Stream and SEEP

Fault-Tolerance

I Overload and failure are handled in the same fashion.

I Operator recovery becomes a special case of scale out, in which afailed operator is scaled out.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 22 / 47

Page 39: Spark Stream and SEEP

Fault-Tolerant Scale Out Algorithm

I Two versions of operator’s state that can be partitioned for scaleout:

• The current state• The recent state checkpoint

I In SEEP, the system partitions the most recent state checkpoint.

I Its benefits:• Avoids adding further load to the operator, which is already

overloaded, by requesting it to checkpoint or partition its own state.• Makes the scale out process itself fault-tolerant.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 23 / 47

Page 40: Spark Stream and SEEP

Fault-Tolerant Scale Out Algorithm

I Two versions of operator’s state that can be partitioned for scaleout:

• The current state• The recent state checkpoint

I In SEEP, the system partitions the most recent state checkpoint.

I Its benefits:• Avoids adding further load to the operator, which is already

overloaded, by requesting it to checkpoint or partition its own state.• Makes the scale out process itself fault-tolerant.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 23 / 47

Page 41: Spark Stream and SEEP

Fault-Tolerant Scale Out Algorithm

I Two versions of operator’s state that can be partitioned for scaleout:

• The current state• The recent state checkpoint

I In SEEP, the system partitions the most recent state checkpoint.

I Its benefits:• Avoids adding further load to the operator, which is already

overloaded, by requesting it to checkpoint or partition its own state.• Makes the scale out process itself fault-tolerant.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 23 / 47

Page 42: Spark Stream and SEEP

Spark Stream

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 24 / 47

Page 43: Spark Stream and SEEP

Existing Streaming Systems (1/2)

I Record-at-a-time processing model:

• Each node has mutable state.

• For each record, updates state and sendsnew records.

• State is lost if node dies.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 25 / 47

Page 44: Spark Stream and SEEP

Existing Streaming Systems (2/2)

I Fault tolerance via replication or upstream backup.

Fast recovery, but 2x hardware cost Only need one standby, but slow to recover

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 26 / 47

Page 45: Spark Stream and SEEP

Existing Streaming Systems (2/2)

I Fault tolerance via replication or upstream backup.

Fast recovery, but 2x hardware cost Only need one standby, but slow to recover

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 26 / 47

Page 46: Spark Stream and SEEP

Observation

I Batch processing models for clusters provide fault tolerance effi-ciently.

I Divide job into deterministic tasks.

I Rerun failed/slow tasks in parallel on other nodes.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 27 / 47

Page 47: Spark Stream and SEEP

Core Idea

I Run a streaming computation as a series of very small and deter-ministic batch jobs.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 28 / 47

Page 48: Spark Stream and SEEP

Challenges

I Latency (interval granularity)• Traditional batch systems replicate state on-disk storage: slow

I Recovering quickly from faults and stragglers

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 29 / 47

Page 49: Spark Stream and SEEP

Proposed Solution

I Latency (interval granularity)• Resilient Distributed Dataset (RDD)• Keep data in memory• No replication

I Recovering quickly from faults and stragglers• Storing the lineage graph• Using the determinism of D-Streams• Parallel recovery of a lost node’s state

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 30 / 47

Page 50: Spark Stream and SEEP

Discretized Stream Processing (D-Stream)

I Run a streaming computation as a series of very small, deterministicbatch jobs.

• Chop up the live stream into batches of X seconds.

• Spark treats each batch of data as RDDs and processes them usingRDD operations.

• Finally, the processed results of the RDD operations are returned inbatches.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 31 / 47

Page 51: Spark Stream and SEEP

Discretized Stream Processing (D-Stream)

I Run a streaming computation as a series of very small, deterministicbatch jobs.

• Chop up the live stream into batches of X seconds.

• Spark treats each batch of data as RDDs and processes them usingRDD operations.

• Finally, the processed results of the RDD operations are returned inbatches.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 31 / 47

Page 52: Spark Stream and SEEP

Discretized Stream Processing (D-Stream)

I Run a streaming computation as a series of very small, deterministicbatch jobs.

• Chop up the live stream into batches of X seconds.

• Spark treats each batch of data as RDDs and processes them usingRDD operations.

• Finally, the processed results of the RDD operations are returned inbatches.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 31 / 47

Page 53: Spark Stream and SEEP

Discretized Stream Processing (D-Stream)

I Run a streaming computation as a series of very small, deterministicbatch jobs.

• Chop up the live stream into batches of X seconds.

• Spark treats each batch of data as RDDs and processes them usingRDD operations.

• Finally, the processed results of the RDD operations are returned inbatches.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 31 / 47

Page 54: Spark Stream and SEEP

D-Stream API (1/4)

I DStream: sequence of RDDs representing a stream of data.• TCP sockets, Twitter, HDFS, Kafka, ...

I Initializing Spark streaming

val scc = new StreamingContext(master, appName, batchDuration,

[sparkHome], [jars])

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 32 / 47

Page 55: Spark Stream and SEEP

D-Stream API (1/4)

I DStream: sequence of RDDs representing a stream of data.• TCP sockets, Twitter, HDFS, Kafka, ...

I Initializing Spark streaming

val scc = new StreamingContext(master, appName, batchDuration,

[sparkHome], [jars])

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 32 / 47

Page 56: Spark Stream and SEEP

D-Stream API (2/4)

I Transformations: modify data from on DStream to a new DStream.• Standard RDD operations (stateless/stateful operations): map, join, ...

• Window operations: group all the records from a sliding window of thepast time intervals into one RDD: window, reduceByAndWindow, ...

Window length: the duration of the window.Slide interval: the interval at which the operation is performed.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 33 / 47

Page 57: Spark Stream and SEEP

D-Stream API (2/4)

I Transformations: modify data from on DStream to a new DStream.• Standard RDD operations (stateless/stateful operations): map, join, ...

• Window operations: group all the records from a sliding window of thepast time intervals into one RDD: window, reduceByAndWindow, ...

Window length: the duration of the window.Slide interval: the interval at which the operation is performed.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 33 / 47

Page 58: Spark Stream and SEEP

D-Stream API (3/4)

I Output operations: send data to external entity• saveAsHadoopFiles, foreach, print, ...

I Attaching input sources

ssc.textFileStream(directory)

ssc.socketStream(hostname, port)

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 34 / 47

Page 59: Spark Stream and SEEP

D-Stream API (3/4)

I Output operations: send data to external entity• saveAsHadoopFiles, foreach, print, ...

I Attaching input sources

ssc.textFileStream(directory)

ssc.socketStream(hostname, port)

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 34 / 47

Page 60: Spark Stream and SEEP

D-Stream API (4/4)

I Stream + Batch: It can be used to apply any RDD operation thatis not exposed in the DStream API.

val spamInfoRDD = sparkContext.hadoopFile(...)

// join data stream with spam information to do data cleaning

val cleanedDStream = inputDStream.transform(_.join(spamInfoRDD).filter(...))

I Stream + Interactive: Interactive queries on stream state from theSpark interpreter

freqs.slice("21:00", "21:05").topK(10)

I Starting/stopping the streaming computation

ssc.start()

ssc.stop()

ssc.awaitTermination()

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 35 / 47

Page 61: Spark Stream and SEEP

D-Stream API (4/4)

I Stream + Batch: It can be used to apply any RDD operation thatis not exposed in the DStream API.

val spamInfoRDD = sparkContext.hadoopFile(...)

// join data stream with spam information to do data cleaning

val cleanedDStream = inputDStream.transform(_.join(spamInfoRDD).filter(...))

I Stream + Interactive: Interactive queries on stream state from theSpark interpreter

freqs.slice("21:00", "21:05").topK(10)

I Starting/stopping the streaming computation

ssc.start()

ssc.stop()

ssc.awaitTermination()

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 35 / 47

Page 62: Spark Stream and SEEP

D-Stream API (4/4)

I Stream + Batch: It can be used to apply any RDD operation thatis not exposed in the DStream API.

val spamInfoRDD = sparkContext.hadoopFile(...)

// join data stream with spam information to do data cleaning

val cleanedDStream = inputDStream.transform(_.join(spamInfoRDD).filter(...))

I Stream + Interactive: Interactive queries on stream state from theSpark interpreter

freqs.slice("21:00", "21:05").topK(10)

I Starting/stopping the streaming computation

ssc.start()

ssc.stop()

ssc.awaitTermination()

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 35 / 47

Page 63: Spark Stream and SEEP

Fault Tolerance

I Spark remembers the sequence of oper-ations that creates each RDD from theoriginal fault-tolerant input data (lineagegraph).

I Batches of input data are replicated inmemory of multiple worker nodes.

I Data lost due to worker failure, can berecomputed from input data.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 36 / 47

Page 64: Spark Stream and SEEP

Example 1 (1/3)

I Get hash-tags from Twitter.

val ssc = new StreamingContext("local[2]", "test", Seconds(1))

val tweets = ssc.twitterStream(<username>, <password>)

DStream: a sequence of RDD representing a stream of data

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 37 / 47

Page 65: Spark Stream and SEEP

Example 1 (2/3)

I Get hash-tags from Twitter.

val ssc = new StreamingContext("local[2]", "test", Seconds(1))

val tweets = ssc.twitterStream(<username>, <password>)

val hashTags = tweets.flatMap(status => getTags(status))

transformation: modify data in one DStream

to create another DStream

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 38 / 47

Page 66: Spark Stream and SEEP

Example 1 (3/3)

I Get hash-tags from Twitter.

val ssc = new StreamingContext("local[2]", "test", Seconds(1))

val tweets = ssc.twitterStream(<username>, <password>)

val hashTags = tweets.flatMap(status => getTags(status))

hashTags.saveAsHadoopFiles("hdfs://...")

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 39 / 47

Page 67: Spark Stream and SEEP

Example 2

I Count frequency of words received every second.

val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1))

val lines = ssc.socketTextStream(args(1), args(2).toInt)

val words = lines.flatMap(_.split(" "))

val ones = words.map(x => (x, 1))

val freqs = ones.reduceByKey(_ + _)

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 40 / 47

Page 68: Spark Stream and SEEP

Example 3

I Count frequency of words received in last minute.

val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1))

val lines = ssc.socketTextStream(args(1), args(2).toInt)

val words = lines.flatMap(_.split(" "))

val ones = words.map(x => (x, 1))

val freqs = ones.reduceByKey(_ + _)

val freqs_60s = freqs.window(Seconds(60), Second(1)).reduceByKey(_ + _)

window length window movement

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 41 / 47

Page 69: Spark Stream and SEEP

Example 3 - Simpler Model

I Count frequency of words received in last minute.

val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1))

val lines = ssc.socketTextStream(args(1), args(2).toInt)

val words = lines.flatMap(_.split(" "))

val ones = words.map(x => (x, 1))

val freqs_60s = ones.reduceByKeyAndWindow(_ + _, Seconds(60), Seconds(1))

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 42 / 47

Page 70: Spark Stream and SEEP

Example 3 - Incremental Window Operators

I Count frequency of words received in last minute.

// Associative only

freqs_60s = ones.reduceByKeyAndWindow(_ + _, Seconds(60), Seconds(1))

// Associative and invertible

freqs_60s = ones.reduceByKeyAndWindow(_ + _, _ - _, Seconds(60), Seconds(1))

Associative only Associative and invertible

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 43 / 47

Page 71: Spark Stream and SEEP

Example 4 - Standalone Application (1/2)

import org.apache.spark.streaming.{Seconds, StreamingContext}

import org.apache.spark.streaming.StreamingContext._

import org.apache.spark.storage.StorageLevel

object NetworkWordCount {

def main(args: Array[String]) {

...

val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1))

val lines = ssc.socketTextStream(args(1), args(2).toInt)

val words = lines.flatMap(_.split(" "))

val ones = words.map(x => (x, 1))

freqs = ones.reduceByKey(_ + _)

freqs.print()

ssc.start()

ssc.awaitTermination()

}

}

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 44 / 47

Page 72: Spark Stream and SEEP

Example 4 - Standalone Application (2/2)

I sics.sbt:

name := "Stream Word Count"

version := "1.0"

scalaVersion := "2.10.3"

libraryDependencies ++= Seq(

"org.apache.spark" %% "spark-core" % "0.9.0-incubating",

"org.apache.spark" %% "spark-streaming" % "0.9.0-incubating"

)

resolvers += "Akka Repository" at "http://repo.akka.io/releases/"

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 45 / 47

Page 73: Spark Stream and SEEP

Summary

I SEEP• Make operator state an external entity• Primitives for state management: checkpoint, backup/restore,

partition

I Spark Stream• Run a streaming computation as a series of very small, deterministic

batch jobs.• DStream: sequence of RDDs• Operators: Transformations (stateless, stateful, and window) and

output operations

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 46 / 47

Page 74: Spark Stream and SEEP

Questions?

Acknowledgements

Some slides and pictures were derived from Matei Zaharia (MITUniversity) and Peter Pietzuch (Imperial College) slides.

Amir H. Payberah (Tehran Polytechnic) SEEP and DStream 1393/9/1 47 / 47