Top Banner
Taming Big Data Leonardo Gamas
60

Spark: Taming Big Data

Apr 16, 2017

Download

Technology

Leonardo Gamas
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Spark: Taming Big Data

Taming Big Data

Leonardo Gamas

Page 2: Spark: Taming Big Data

Leonardo Gamas

Software Engineer @ JusBrasil

@leogamas

Page 3: Spark: Taming Big Data

What is Spark?

"Apache Spark™ is a fast and general engine for large-

scale data processing."

Page 4: Spark: Taming Big Data

One engine to rule them all?

Page 5: Spark: Taming Big Data

Spark is Fast

Page 6: Spark: Taming Big Data

Spark is Integrated

Page 7: Spark: Taming Big Data

Spark is simple

file = spark.textFile("hdfs://...") file.flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b)

Page 8: Spark: Taming Big Data

Language support

● Scala● Java● Python

Page 9: Spark: Taming Big Data

Community

Page 10: Spark: Taming Big Data

Community

Page 11: Spark: Taming Big Data

Who is using Spark?

Page 12: Spark: Taming Big Data

RDD

ResilientDistributedDataset

"Fault-tolerant collection of elements that can be operated on in parallel"

Page 13: Spark: Taming Big Data

Dataset● Transformations

○ RDD => RDD○ Lazy

● Actions○ RDD => Stuff○ Not lazy

Page 14: Spark: Taming Big Data

RDD Transformations

Page 15: Spark: Taming Big Data

RDD Transformations● map(func)● filter(func)● flatMap(func)● mapPartitions(func)● mapPartitionsWithIndex

(func)● sample(withReplacement,

fraction, seed)● union(otherDataset)● intersection(otherDataset) ● distinct([numTasks])● groupByKey([numTasks])

● reduceByKey(func, [numTasks])● aggregateByKey(zeroValue)

(seqOp, combOp, [numTasks])● sortByKey([ascending],

[numTasks])● join(otherDataset, [numTasks])● cogroup(otherDataset,

[numTasks])● cartesian(otherDataset)● pipe(command, [envVars])● coalesce(numPartitions)● repartition(numPartitions)

Page 16: Spark: Taming Big Data

RDD Actions

Page 17: Spark: Taming Big Data

RDD Actions

● reduce(func)● collect()● count()● first()● take(n)● takeSample

(withReplacement,num, [seed])

● takeOrdered(n, [ordering])● saveAsTextFile(path)● saveAsSequenceFile

(path) ● saveAsObjectFile(path) ● countByKey()● foreach(func)

Page 18: Spark: Taming Big Data

Distributed

Page 19: Spark: Taming Big Data

Distributed

Page 20: Spark: Taming Big Data

Resilient

Page 21: Spark: Taming Big Data

Resilient

"RDDs track lineage information that can be used to efficiently recompute lost data."

Page 22: Spark: Taming Big Data

Resilient

Page 23: Spark: Taming Big Data

Resilient

Page 24: Spark: Taming Big Data

Resilient

Page 25: Spark: Taming Big Data

Resilient

Page 26: Spark: Taming Big Data

Resilient

Page 27: Spark: Taming Big Data

RDDs are cacheable

access disk twice

Page 28: Spark: Taming Big Data

RDDs are immutable

Page 29: Spark: Taming Big Data

RDD Internals

● Partitions (Splits)● Dependencies● Action (how to retrieve data)● Location hint (pref)● Partitioner

Page 30: Spark: Taming Big Data

Broadcast Variables

Page 31: Spark: Taming Big Data

Deployment

● Mesos● YARN● Standalone

Page 32: Spark: Taming Big Data

Spark Projects

Page 33: Spark: Taming Big Data

Spark Projects

Page 34: Spark: Taming Big Data

Spark Projects

Spark Core

Page 35: Spark: Taming Big Data

Spark Projects

Spark CoreSpark SQL

Page 36: Spark: Taming Big Data

Spark Projects

Spark CoreSpark SQL

Spark Streaming

Page 37: Spark: Taming Big Data

Spark Projects

Spark CoreSpark SQL

Spark Streaming Spark MLlib

Page 38: Spark: Taming Big Data

Spark Projects

Spark CoreSpark SQL

Spark Streaming Spark MLlib

Spark GraphX

Page 39: Spark: Taming Big Data

Spark SQLcase class Person(name: String, age: Int) //Class// Map RDDval people = sc.textFile("...")

.map(_.split(","))

.map(p => Person(p(0), p(1).trim.toInt)) //Register as tablepeople.registerTempTable("people")// Queryval teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

Page 40: Spark: Taming Big Data

Spark SQL - Hiveval sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

// Create table and load data

sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value

STRING)")

sqlContext.sql("LOAD DATA LOCAL INPATH '...' INTO TABLE src")

// Queries are expressed in HiveQL

sqlContext.sql("FROM src SELECT key, value").foreach(println)

Page 41: Spark: Taming Big Data

Spark Streaming

Page 42: Spark: Taming Big Data

Spark Streaming

Page 43: Spark: Taming Big Data

Spark Streamingval ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

ssc.start()

ssc.awaitTermination()

Page 44: Spark: Taming Big Data

Spark Streaming

Page 45: Spark: Taming Big Data

Spark Streamingdef updateFunction(newValues: Seq[Int], runningCount: Option

[Int]): Option[Int] = {

val newCount = ... // add the new values with the previous

Some(newCount)

}

val runningCounts = pairs.updateStateByKey[Int](updateFunction _)

Page 46: Spark: Taming Big Data

Spark Streaming

Page 47: Spark: Taming Big Data

Spark MLlib

MLlib is Apache Spark's scalable machine learning library.

Page 48: Spark: Taming Big Data

MLlib - Algorithms

● Linear Algebra● Basic Statistics● Classification and Regression

○ Linear model (SVM, Logistic regression, Linear Regression)

○ Decision trees○ Naive Bayes

Page 49: Spark: Taming Big Data

MLlib - Algorithms

● Collaborative filtering (ALS)● Clustering (K-Means)● Dimensionality Reduction (SVD and PCA)● Feature extraction and transformation● Optimization (SGD, L-BFGS)

Page 50: Spark: Taming Big Data

MLlib - K-Means

points = spark.textFile("hdfs://...") .map(parsePoint)

model = KMeans.train(points, k=10)

cluster = model.predict(testPoint)

Page 51: Spark: Taming Big Data

MLlib - ALS Recommendationval data = sc.textFile("...")

val ratings = data.map(_.split(',') match { case Array(user, item,

rate) =>

Rating(user.toInt, item.toInt, rate.toDouble)

})

// Build the recommendation model using ALS rank = 10, iters = 20

val model = ALS.train(ratings, rank, numIterations, 0.01)

val recommendations = model.recommendProducts(userId, 10)

Page 52: Spark: Taming Big Data

MLlib - Naive Bayesval data = sc.textFile("data/mllib/sample_naive_bayes_data.txt")

val training = data.map { line =>

val parts = line.split(',')

LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ')

.map(_.toDouble)))

}

val model = NaiveBayes.train(training, lambda = 1.0)

Page 53: Spark: Taming Big Data

GraphX is Apache Spark's API for graphs and graph-parallel

computation.

Page 54: Spark: Taming Big Data

GraphX

Page 55: Spark: Taming Big Data

GraphX

Page 56: Spark: Taming Big Data

GraphX

Page 57: Spark: Taming Big Data

GraphX

Page 58: Spark: Taming Big Data

GraphX - Algorithms

● Connected Components● Triangle Count● Strongly Connected Components● PageRank

Page 59: Spark: Taming Big Data

// Run PageRank

val ranks = graph.pageRank(0.0001).vertices

// Join the ranks with the usernames

val users = sc.textFile("...").map { ... => (id, username)) }

val ranksByUsername = users.join(ranks).map {

case (id, (username, rank)) => (username, rank)

}

// Print the result

println(ranksByUsername.collect().mkString("\n"))

GraphX - PageRank

Page 60: Spark: Taming Big Data

Questions?