Spark Streaming @ Berlin Apache Spark Meetup, March 2015

Spark Streaming

March 2015

A Brief Introduction for Developers

@StratioBD

Who am I?

SPARK STREAMING OVERVIEW

Big Data Developer at Stratio. Working on ingestion and streaming projects with Spark Streaming and Apache Flume. Currently researching on Spark SQL optimizations and other stuff.

Santiago Mola

@mola_io

SPARK

• What is Apache Spark?

• RDD

• RDD API

1 2 SPARK STREAMING

• What is Spark Streaming?

• Who uses it?

• Receivers

• Discretized Streams (DStream)

• Window functions

• Use case: Twitter text classification

INDEX

WHAT ISAPACHE SPARK?1

1.1. What is Apache Spark?

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Apache Spark™ is a fast and general engine for large-scale data processing.

“The Spark engine runs in a variety of environments, from cloud services to Hadoop or Mesosclusters. It is used to perform ETL, interactive queries (SQL), advanced analytics (e.g. machine learning) and streaming over large datasets in a wide range of data stores (e.g. HDFS, Cassandra, HBase, S3). Spark supports a variety of popular development languages including Java, Python and Scala.”

Databricks – What is Spark?https://databricks.com/spark/about

https://databricks.com/spark/about

1.1. What is Apache Spark?


1.1. What does it look like?


Let’s count words…

val textFile = spark.textFile("hdfs://...")

val counts = textFile

.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

1.2. Resilient Distributed Dataset (RDD)


A RDD is a collection of elements that is immutable, distributed and fault-tolerant.

Transformations can be applied to a RDD, resulting in new RDD.

Actions can be applied to a RDD to obtain a value.

RDD is lazy.

Resilient Distributed Dataset (RDD)



RDD[String]

(textFile)

“hello world”

“foo bar”

“foo foo bar”

“bye world”

RDD[String]

(flatMap)

“hello”“world”

“foo”“bar”

“foo”“foo”“bar”“bye”

“world”

RDD[(String,Int)]

(map)

(“hello”, 1)(“world”, 1)

(“foo”, 1)(“bar”, 1)

(“foo”, 1)(“foo”, 1)(“bar”, 1)(“bye”, 1)

(“world”, 1)

RDD[(String,Int)]

(reduceByKey)

(“hello”, 1)(“foo”, 3)(“bar”, 2)

(“bye”, 1)(“world”, 2)



val textFile : RDD[String] = spark.textFile("hdfs://...")

val flatMapped : RDD[String] = textFile.flatMap(line => line.split(" "))

val mapped : RDD[(String,Int)] = flatMapped.map(Word => (word, 1))

val counts : RDD[(String,Int)] = mapped.reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

1.3. RDD API


map(func)filter(func) flatMap(func)mapPartitions(func) mapPartitionsWithIndex(func) sample(withReplacement, fraction, seed)union(otherDataset)intersection(otherDataset)distinct([numTasks]))groupByKey([numTasks]) reduceByKey(func, [numTasks]) aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) sortByKey([ascending], [numTasks]) join(otherDataset, [numTasks]) cogroup(otherDataset, [numTasks]) cartesian(otherDataset) pipe(command, [envVars]) coalesce(numPartitions) repartition(numPartitions) repartitionAndSortWithinPartitions(partitioner)

Transformations

reduce(func) collect() count() first() take(n) takeSample(withReplacement, num, [seed]) takeOrdered(n, [ordering]) saveAsTextFile(path)saveAsSequenceFile(path)saveAsObjectFile(path)countByKey()foreach(func)

Actions

https://spark.apache.org/docs/latest/programming-guide.html

Full docs

https://spark.apache.org/docs/latest/programming-guide.html

1. Recap…


• Apache Spark is an awesome, distributed, fault-tolerant, easy-to-use processing engine.

• The most important concept is the RDD, which is an immutable and distributed collection of elements.

• RDD API provides a lot of high-level transformations that make distributed processing easier.

• On top of Spark core, we have MLLib (machine learning), Spark SQL (query engine), GraphX (graphalgorithms) and… Spark Streaming (stream processing)!

2. SPARKSTREAMING

2.1 What is Spark Streaming?


Source: http://spark.apache.org/docs/latest/streaming-programming-guide.html

http://spark.apache.org/docs/latest/streaming-programming-guide.html

2.1 What is Spark Streaming?


Source: http://spark.apache.org/docs/latest/streaming-programming-guide.html

http://spark.apache.org/docs/latest/streaming-programming-guide.html

2.1 Who uses it?


Source: http://es.slideshare.net/pacoid/databricks-meetup-los-angeles-apache-spark-user-group

http://es.slideshare.net/pacoid/databricks-meetup-los-angeles-apache-spark-user-group

2.2. Receivers


• File Stream

• Sockets

• Actors (Akka)

• Queue RDDs (Testing)

2.2. Receivers


Twitter

Flume

Kafka

Kinesis

2.2. Discretized streams (DStream)


Spark Streaming does not work with continuous live streams, but with a discretized representation.

The DStream (discretized stream) represents a sequence of RDDs, each of them corresponding to a micro-batch.



Let’s count words… again…

val textStream = ssc.socketTextStream(“localhost“, 9000)

val counts = textStream



.reduceByKey(_ + _)

counts.print()

2.2. Discretized streams (DStream)


2.2. Window operations




Let’s count words… and print every 10 seconds the counters of the last 60 seconds

val textStream = ssc.socketTextStream(“localhost“, 9000)

val counts = textStream



.reduceByKeyAndWindow(_ + _, Seconds(60), Seconds(10))

counts.print()

2.4. Twitter text classification


println("Initializing Streaming Spark Context...")

val conf = new SparkConf().setAppName(this.getClass.getSimpleName)

val ssc = new StreamingContext(conf, Seconds(5))

println("Initializing Twitter stream...")

val tweets = TwitterUtils.createStream(ssc, Utils.getAuth)

val statuses = tweets.map(_.getText)

println("Initalizaing the the KMeans model...")

val model =

new KMeansModel(ssc.sparkContext.objectFile[Vector](modelFile.toString).collect())

val filteredTweets = statuses

.filter(t => model.predict(Utils.featurize(t)) == clusterNumber)

filteredTweets.print()

Source: http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html

2.5. Recap…


Source: http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html

• Spark Streaming uses a discrete representation of live streams, where each batch is a RDD.

• Data can be received from a wide variety of sources.

• Streaming APIs resemble RDD APIs: learning it is trivial for Spark (batch) users.

• Streaming API has a wide variety of high-level transformations (most transformations available to RDD + window transformations).

• It can be combined with the RDD API… that means integration with Mllib (machine learning), GraphX(graph algorithms), RDD persistence or any other Spark components.

Thanks!

www.stratio.com

https://github.com/Stratio

@StratioBD

http://www.stratio.com/

SUPPORT SLIDES


RDD, Stages…


StreamingContext


Checkpointing


Transform


Spark Streaming @ Berlin Apache Spark Meetup, March 2015

Data & Analytics

spark engine

spark streamingmarch

brief introduction

spark sql optimizations

streaming projects

rdd apispark streaming

rdd rdd api1

val counts