Top Banner
Spark Streaming March 2015 A Brief Introduction for Developers @StratioBD
32
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

Spark Streaming

March 2015

A Brief Introduction for Developers

@StratioBD

Page 2: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

Who am I?

SPARK STREAMING OVERVIEW

Big Data Developer at Stratio. Working on ingestion and streaming projects with Spark Streaming and Apache Flume. Currently researching on Spark SQL optimizations and other stuff.

Santiago Mola

@mola_io

Page 3: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

SPARK

• What is Apache Spark?

• RDD

• RDD API

1 2 SPARK STREAMING

• What is Spark Streaming?

• Who uses it?

• Receivers

• Discretized Streams (DStream)

• Window functions

• Use case: Twitter text classification

INDEX

Page 4: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

WHAT ISAPACHE SPARK?1

Page 5: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

1.1. What is Apache Spark?

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Apache Spark™ is a fast and general engine for large-scale data processing.

“The Spark engine runs in a variety of environments, from cloud services to Hadoop or Mesosclusters. It is used to perform ETL, interactive queries (SQL), advanced analytics (e.g. machine learning) and streaming over large datasets in a wide range of data stores (e.g. HDFS, Cassandra, HBase, S3). Spark supports a variety of popular development languages including Java, Python and Scala.”

Databricks – What is Spark?https://databricks.com/spark/about

Page 6: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

1.1. What is Apache Spark?

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Page 7: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

1.1. What does it look like?

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Let’s count words…

val textFile = spark.textFile("hdfs://...")

val counts = textFile

.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Page 8: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

1.2. Resilient Distributed Dataset (RDD)

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

A RDD is a collection of elements that is immutable, distributed and fault-tolerant.

Transformations can be applied to a RDD, resulting in new RDD.

Actions can be applied to a RDD to obtain a value.

RDD is lazy.

Resilient Distributed Dataset (RDD)

Page 9: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

1.2. Resilient Distributed Dataset (RDD)

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

RDD[String]

(textFile)

“hello world”

“foo bar”

“foo foo bar”

“bye world”

RDD[String]

(flatMap)

“hello”“world”

“foo”“bar”

“foo”“foo”“bar”“bye”

“world”

RDD[(String,Int)]

(map)

(“hello”, 1)(“world”, 1)

(“foo”, 1)(“bar”, 1)

(“foo”, 1)(“foo”, 1)(“bar”, 1)(“bye”, 1)

(“world”, 1)

RDD[(String,Int)]

(reduceByKey)

(“hello”, 1)(“foo”, 3)(“bar”, 2)

(“bye”, 1)(“world”, 2)

Page 10: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

1.2. Resilient Distributed Dataset (RDD)

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

val textFile : RDD[String] = spark.textFile("hdfs://...")

val flatMapped : RDD[String] = textFile.flatMap(line => line.split(" "))

val mapped : RDD[(String,Int)] = flatMapped.map(Word => (word, 1))

val counts : RDD[(String,Int)] = mapped.reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Page 11: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

1.3. RDD API

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

map(func)filter(func) flatMap(func)mapPartitions(func) mapPartitionsWithIndex(func) sample(withReplacement, fraction, seed)union(otherDataset)intersection(otherDataset)distinct([numTasks]))groupByKey([numTasks]) reduceByKey(func, [numTasks]) aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) sortByKey([ascending], [numTasks]) join(otherDataset, [numTasks]) cogroup(otherDataset, [numTasks]) cartesian(otherDataset) pipe(command, [envVars]) coalesce(numPartitions) repartition(numPartitions) repartitionAndSortWithinPartitions(partitioner)

Transformations

reduce(func) collect() count() first() take(n) takeSample(withReplacement, num, [seed]) takeOrdered(n, [ordering]) saveAsTextFile(path)saveAsSequenceFile(path)saveAsObjectFile(path)countByKey()foreach(func)

Actions

https://spark.apache.org/docs/latest/programming-guide.html

Full docs

Page 12: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

1. Recap…

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

• Apache Spark is an awesome, distributed, fault-tolerant, easy-to-use processing engine.

• The most important concept is the RDD, which is an immutable and distributed collection of elements.

• RDD API provides a lot of high-level transformations that make distributed processing easier.

• On top of Spark core, we have MLLib (machine learning), Spark SQL (query engine), GraphX (graphalgorithms) and… Spark Streaming (stream processing)!

Page 13: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

2. SPARKSTREAMING

Page 14: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

2.1 What is Spark Streaming?

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Source: http://spark.apache.org/docs/latest/streaming-programming-guide.html

Page 15: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

2.1 What is Spark Streaming?

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Source: http://spark.apache.org/docs/latest/streaming-programming-guide.html

Page 16: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

2.1 Who uses it?

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Source: http://es.slideshare.net/pacoid/databricks-meetup-los-angeles-apache-spark-user-group

Page 17: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

2.2. Receivers

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

• File Stream

• Sockets

• Actors (Akka)

• Queue RDDs (Testing)

Page 18: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

2.2. Receivers

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Twitter

Flume

Kafka

Kinesis

Page 19: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

2.2. Discretized streams (DStream)

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Spark Streaming does not work with continuous live streams, but with a discretized representation.

The DStream (discretized stream) represents a sequence of RDDs, each of them corresponding to a micro-batch.

Page 20: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

2.3. What does it look like?

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Let’s count words… again…

val textStream = ssc.socketTextStream(“localhost“, 9000)

val counts = textStream

.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.print()

Page 21: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

2.2. Discretized streams (DStream)

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Page 22: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

2.2. Window operations

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Page 23: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

2.3. What does it look like?

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Let’s count words… and print every 10 seconds the counters of the last 60 seconds

val textStream = ssc.socketTextStream(“localhost“, 9000)

val counts = textStream

.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKeyAndWindow(_ + _, Seconds(60), Seconds(10))

counts.print()

Page 24: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

2.4. Twitter text classification

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

println("Initializing Streaming Spark Context...")

val conf = new SparkConf().setAppName(this.getClass.getSimpleName)

val ssc = new StreamingContext(conf, Seconds(5))

println("Initializing Twitter stream...")

val tweets = TwitterUtils.createStream(ssc, Utils.getAuth)

val statuses = tweets.map(_.getText)

println("Initalizaing the the KMeans model...")

val model =

new KMeansModel(ssc.sparkContext.objectFile[Vector](modelFile.toString).collect())

val filteredTweets = statuses

.filter(t => model.predict(Utils.featurize(t)) == clusterNumber)

filteredTweets.print()

Source: http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html

Page 25: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

2.5. Recap…

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Source: http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html

• Spark Streaming uses a discrete representation of live streams, where each batch is a RDD.

• Data can be received from a wide variety of sources.

• Streaming APIs resemble RDD APIs: learning it is trivial for Spark (batch) users.

• Streaming API has a wide variety of high-level transformations (most transformations available to RDD + window transformations).

• It can be combined with the RDD API… that means integration with Mllib (machine learning), GraphX(graph algorithms), RDD persistence or any other Spark components.

Page 26: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

Thanks!

www.stratio.com

https://github.com/Stratio

@StratioBD

Page 27: Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Page 28: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

SUPPORT SLIDES

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Page 29: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

RDD, Stages…

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Page 30: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

StreamingContext

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Page 31: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

Checkpointing

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015

Page 32: Spark Streaming @ Berlin Apache Spark Meetup, March 2015

Transform

Spark Streaming: A Brief Introduction for Developers @ Berlin, March 2015