Top Banner
1 Introduction to Spark Gwen Shapira, Solutions Architect
54

Intro to Spark - for Denver Big Data Meetup

Sep 08, 2014

Download

Engineering

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Intro to Spark - for Denver Big Data Meetup

1

Introduction to SparkGwen Shapira, Solutions Architect

Page 2: Intro to Spark - for Denver Big Data Meetup

2

Spark is next-generation Map Reduce

Page 3: Intro to Spark - for Denver Big Data Meetup

3

MapReduce has been around for a whileIt made distributed compute easier

But, can we do better?

Page 4: Intro to Spark - for Denver Big Data Meetup

4

MapReduce Issues

• Launching mappers and reducers takes time• One MR job can rarely do a full computation• Writing to disk (in triplicate!) between each job• Going back to queue between jobs• No in-memory caching• No iterations• Very high latency• Not the greatest APIs either

Page 5: Intro to Spark - for Denver Big Data Meetup

5

Spark:Easy to Develop, Fast to Run

Page 6: Intro to Spark - for Denver Big Data Meetup

6

Spark Features

• In-memory cache• General execution graphs• APIs in Scala, Java and Python• Integrates but does not depend on Hadoop

Page 7: Intro to Spark - for Denver Big Data Meetup

7

Why is it better?

• (Much) Faster than MR• Iterative programming – Must have for ML• Interactive – allows rapid exploratory analytics• Flexible execution graph:

• Map, map, reduce, reduce, reduce, map• High productivity compared to MapReduce

Page 8: Intro to Spark - for Denver Big Data Meetup

8

Word Count

file = spark.textFile(“hdfs://…”)

file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_)

Remember MapReduce WordCount?

Page 9: Intro to Spark - for Denver Big Data Meetup

9

Agenda

• Concepts• Examples• Streaming• Summary

Page 10: Intro to Spark - for Denver Big Data Meetup

10

Concepts

Page 11: Intro to Spark - for Denver Big Data Meetup

11

AMP Lab BDAS

Page 12: Intro to Spark - for Denver Big Data Meetup

12

CDH5 (simplified)

HDFS + In-Memory Cache

YARN

Spark MR Impala

Spark Streaming ML Lib

Page 13: Intro to Spark - for Denver Big Data Meetup

13

How Spark runs on a Cluster

Driver

Worker

Worker

Data

RAM

Data

RAMWorker

Data

RAM

Tasks

Results

Page 14: Intro to Spark - for Denver Big Data Meetup

14

Workflow

• SparkContext in driver connects to Master• Master allocates resources for app on cluster• SC acquires executors on worker nodes • SC sends the app code (JAR) to executors• SC sends tasks to executors

Page 15: Intro to Spark - for Denver Big Data Meetup

15

RDD – Resilient Distributed Dataset

• Collection of elements• Read-only• Partitioned• Fault-tolerant• Supports parallel operations

Page 16: Intro to Spark - for Denver Big Data Meetup

16

RDD Types

• Parallelized Collection• Parallelize(Seq)

• HDFS files• Text, Sequence or any InputFormat

• Both support same operations

Page 17: Intro to Spark - for Denver Big Data Meetup

17

Operations

Transformations• Map• Filter• Sample• Join• ReduceByKey• GroupByKey• Distinct

Actions• Reduce• Collect• Count• First, Take• SaveAs• CountByKey

Page 18: Intro to Spark - for Denver Big Data Meetup

18

Transformations are lazy

Page 19: Intro to Spark - for Denver Big Data Meetup

19

Lazy transformation

Find all lines that mention “MySQL”

Only the timestamp portion of the line

Set the date and hour as key, 1 as value

Now reduce by key and sum the values

Return the result as Array so I can print

Find lines, get timestamp…

Aha! Finally something

to do!

Page 20: Intro to Spark - for Denver Big Data Meetup

20

Persistence / Caching

• Store RDD in memory for later use• Each node persists a partition• Persist() marks an RDD for caching• It will be cached first time an action is performed

• Use for iterative algorithms

Page 21: Intro to Spark - for Denver Big Data Meetup

21

Caching – Storage Levels

• MEMORY_ONLY• MEMORY_AND_DISK• MEMORY_ONLY_SER• MEMORY_AND_DISK_SER• DISK_ONLY• MEMORY_ONLY_2, MEMORY_AND_DISK_2…

Page 22: Intro to Spark - for Denver Big Data Meetup

22

Fault Tolerance

• Lost partitions can be re-computed from source data• Because we remember all transformations

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDD Mapped RDDfilter

(func = startsWith(…))map

(func = split(...))

Page 23: Intro to Spark - for Denver Big Data Meetup

23

Examples

Page 24: Intro to Spark - for Denver Big Data Meetup

24

Word Count

file = spark.textFile(“hdfs://…”)

file.flatMap(line = > line.split(“ “)) .map(word=>(word,1)) .reduceByKey(_+_)

Remember MapReduce WordCount?

Page 25: Intro to Spark - for Denver Big Data Meetup

25

Log Mining

• Load error messages from a log into memory• Interactively search for patterns

Page 26: Intro to Spark - for Denver Big Data Meetup

26

Log Mining

lines = spark.textFile(“hdfs://…”)errors = lines.filter(_.startsWith(“ERROR”)messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count…

Base RDD

Transformed RDD

Action

Page 27: Intro to Spark - for Denver Big Data Meetup

27

Logistic Regression

• Read two sets of points• Looks for a plane W that separates them• Perform gradient descent:

• Start with random W• On each iteration, sum a function of W over the data• Move W in a direction that improves it

Page 28: Intro to Spark - for Denver Big Data Meetup

28

Intuition

Page 29: Intro to Spark - for Denver Big Data Meetup

29

Logistic Regression

val points = spark.textFile(…).map(parsePoint).cache()

val w = Vector.random(D)

for (I <- 1 to ITERATIONS) {val gradient = points.map(p =>

(1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x ).reduce(_+_)

w -= gradient

}println(“Final separating plane: ” + w)

Page 30: Intro to Spark - for Denver Big Data Meetup

30

Conviva Use-Case

• Monitor online video consumption• Analyze trends

Need to run tens of queries like this a day:

SELECT videoName, COUNT(1)FROM summariesWHERE date='2011_12_12' AND customer='XYZ'GROUP BY videoName;

Page 31: Intro to Spark - for Denver Big Data Meetup

31

Conviva With Spark

val sessions = sparkContext.sequenceFile[SessionSummary,NullWritable](pathToSessionSummaryOnHdfs)

val cachedSessions = sessions.filter(whereConditionToFilterSessions).cache

val mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) }val reduceFn : (Long, Long) => Long = { (a,b) => a+b }

val results = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap

Page 32: Intro to Spark - for Denver Big Data Meetup

32

Streaming

Page 33: Intro to Spark - for Denver Big Data Meetup

33

What is it?

• Extension of Spark API• For high-throughput fault-tolerant processing

of live data streams

Page 34: Intro to Spark - for Denver Big Data Meetup

34

Sources & Outputs

• Kafka• Flume• Twitter• JMS Queues• TCP sockets

• HDFS• Databases• Dashboards

Page 35: Intro to Spark - for Denver Big Data Meetup

35

Architecture

InputStreaming

ContextSpark

Context

Page 36: Intro to Spark - for Denver Big Data Meetup

36

DStreams

• Stream is broken down into micro-batches• Each micro-batch is an RDD• This means any Spark function or library can apply to

a stream• Including ML-Lib, graph processing, etc.

Page 37: Intro to Spark - for Denver Big Data Meetup

37

Processing DStreams

Page 38: Intro to Spark - for Denver Big Data Meetup

38

Processing Dstreams - Stateless

Page 39: Intro to Spark - for Denver Big Data Meetup

39

Processing Dstreams - Stateful

Page 40: Intro to Spark - for Denver Big Data Meetup

40

Dstream Operators

• Transformationproduce DStream from one or more parent streams• Stateless (independent per interval)Map, reduce

• Stateful (share data across intervals)Window, incremental aggregation, time-skewed join

• OutputWrite data to external system (save RDD to HDFS)Save, foreach

Page 41: Intro to Spark - for Denver Big Data Meetup

41

Fault Recovery

• Input from TCP, Flume or Kafka is stored on 2 nodes• In case of failure:

missing RDDs will be re-computed from surviving nodes.• RDDs are deterministic• So any computation will lead to the same result• Transformation can guarantee

exactly once semantics.• Even through failure

Page 42: Intro to Spark - for Denver Big Data Meetup

42

Key Question -

How fast can the system recover?

Page 43: Intro to Spark - for Denver Big Data Meetup

43

Example – Streaming WordCount

import org.apache.spark.streaming.{Seconds, StreamingContext}import StreamingContext._...

// Create the context and set up a network input stream val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1))val lines = ssc.socketTextStream(args(1), args(2).toInt)

// Split the lines into words, count them// print some of the counts on the masterval words = lines.flatMap(_.split(" "))val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)wordCounts.print()

// Start the computationssc.start()

Page 44: Intro to Spark - for Denver Big Data Meetup

44

Shark

Page 45: Intro to Spark - for Denver Big Data Meetup

45

Shark Architecture

• Identical to Hive• Same CLI, JDBC, SQL Parser, Metastore

• Replaced the optimizer, plan generator and the execution engine.

• Added Cache Manager. • Generate Spark code instead of Map Reduce

Page 46: Intro to Spark - for Denver Big Data Meetup

46

Hive Compatibility

• MetaStore• HQL• UDF / UDAF• SerDes• Scripts

Page 47: Intro to Spark - for Denver Big Data Meetup

47

Dynamic Query Plans

• Hive MetaData often lacks statistics• Join types often requires hinting

• Shark gathers statistics per partition• While materializing map output

• Partition sizes, record count, skew, histograms• Alter plan accordingly

Page 48: Intro to Spark - for Denver Big Data Meetup

48

Columnar Memory Store

• Better compression• CPU efficiency• Cache Locality

Page 49: Intro to Spark - for Denver Big Data Meetup

49

Spark + Shark Integration

val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid")

val features = users.mapRows { row => new Vector(extractFeature1(row.getInt("age")),

extractFeature2(row.getStr("country")), ...)} val trainedVector = logRegress(features.cache())

Page 50: Intro to Spark - for Denver Big Data Meetup

50

Summary

Page 51: Intro to Spark - for Denver Big Data Meetup

51

Why Spark?

• Flexible • High performance• Machine learning,

iterative algorithms• Interactive data

explorations• Developer productivity

Page 52: Intro to Spark - for Denver Big Data Meetup

52

Why not Spark?

• Still immature• Uses *lots* of memory• Equivalent functionality

in Impala, Storm, etc

Page 53: Intro to Spark - for Denver Big Data Meetup

53

How Spark Works?

• RDDs – resilient distributed data• Lazy transformations• Fault tolerant caching• Streams – micro-batches of RDDs

Page 54: Intro to Spark - for Denver Big Data Meetup

54