Osd ctw spark

Spark Next generation cloud

computing engine

Wisely Chen

Agenda• What is Spark?

• Next big thing

• How to use Spark?

• Demo

• Q&A

Who am I?

• Wisely Chen ( [email protected] )

• Sr. Engineer in Yahoo![Taiwan] data team

• Loves to promote open source tech

• Hadoop Summit 2013 San Jose

• Jenkins Conf 2013 Palo Alto

• Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012

mailto:[email protected]

Taiwan Data Team

Data!Highway

BI!Report

Serving!API

Data!Mart

ETL /Forecast

Machine!Learning

Machine Learning

Distribute Computing

Big Data

Recommendation

Forecast

HADOOP

Faster ML

Distribute Computing

Bigger Big Data

Opinion from Cloudera• The leading candidate for “successor to

MapReduce” today is Apache Spark

• No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. !

• From http://0rz.tw/y3OfM

http://0rz.tw/y3OfM

What is Spark

• From UC Berkeley AMP Lab

• Most activity Big data open source project since Hadoop

Where is Spark?

HDFS

YARN

MapReduce

Hadoop 2.0

Storm HBase Others

HDFS

YARN

MapReduce

Hadoop Architecture

Hive

Storage

Resource Management

Computing Engine

SQL

HDFS

YARN

MapReduce

Hadoop vs Spark

Spark

Hive Shark

Spark vs Hadoop• Spark run on Yarn, Mesos or Standalone mode

• Spark’s main concept is based on MapReduce

• Spark can read from

• HDFS: data locality

• HBase

• Cassandra

More than MapReduce

HDFS

Spark Core : MapReduce

Shark: Hive GraphX: Pregel MLib: MahoutStreaming:

Storm

Resource Management System(Yarn, Mesos)

Why Spark?

天下武功，無堅不破，惟快不破

3X~25X than MapReduce framework !

From Matei’s paper: http://0rz.tw/VVqgP

Logistic regression

Runn

ing

Tim

e(S)

0

20

40

60

80

MR Spark3

76

KMeans

0

27.5

55

82.5

110

MR Spark

33

106

PageRank

0

45

90

135

180

MR Spark

23

171

http://0rz.tw/VVqgP

What is Spark

• Apache Spark™ is a very fast and general engine for large-scale data processing

Why is Spark so fast?

HDFS

• 100X lower than memory

• Store data into Network+Disk

• Network speed is 100X than memory

• Implement fault tolerance

MapReduce Pagerank!

• …..readInputFromHDFS…

• for (int runs = 0; runs < iter_runnumber ; runs++) {

• ………….. • isCompleted = runRankCalculation(inPath,lastResultPath);

• …………

• }

• …..writeOutputToHDFS….

Workflow

Input HDFS

Iter 1 RunRank

Tmp HDFS

Iter 2 RunRank

Tmp HDFS

Iter N RunRank

Input HDFS

Iter 1 RunRank

Tmp Mem

Iter 2 RunRank

Tmp Mem

Iter N RunRank

MapReduce

Spark

First iteration!take 200 sec

3rd iteration!take 20 sec

Page Rank algorithm in 1 billion record url

2nd iteration!take 20 sec

RDD

• Resilient Distributed Dataset

• Collections of objects spread across a cluster, stored in RAM or on Disk

• Built through parallel transformations

Fault Tolerance

天下武功，無堅不破，惟快不破

RDD

RDD a RDD b

val a =sc.textFile(“hdfs://....”)

val b = a.filer( line=>line.contain(“Spark”) )

Value c

val c = b.count()

Transformation Action

hdfs://....%E2%80%9D

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!

Worker!!!!

Worker!!!!Task

TaskTask

Log mining


Driver

Worker!!!!!Block1

RDD a

Worker!!!!!Block2

RDD a

Worker!!!!!Block3

RDD a

Log mining


Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Block1 Block2

Block3

Log mining


Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Block1 Block2

Block3

Log mining


Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Cache1 Cache2

Cache3

Log mining


Driver

Worker!!!!!

RDD m

Worker!!!!!

RDD m

Worker!!!!!

RDD m

Cache1 Cache2

Cache3

Log mining


Driver

Worker!!!!!

RDD a

Worker!!!!!

RDD a

Worker!!!!!

RDD a

Cache1 Cache2

Cache3

1st iteration(no cache)!

take same time

with cache!take 7 sec

RDD Cache

RDD Cache

• Data locality

• CacheA big shuffle!take 20min

After cache, take only 265ms

self join 5 billion record data

Easy to use

• Interactive Shell

• Multi Language API

• JVM: Scala, JAVA

• PySpark: Python

Scala Word Count• val file = spark.textFile("hdfs://...")

• val counts = file.flatMap(line => line.split(" "))

• .map(word => (word, 1))

• .reduceByKey(_ + _)

• counts.saveAsTextFile("hdfs://...")

Step by Step

• file.flatMap(line => line.split(" “)) => (aaa,bb,cc)

• .map(word => (word, 1)) => ((aaa,1),(bb,1)..)

• .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)

Java Wordcount• JavaRDD<String> file = spark.textFile("hdfs://...");

• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()

• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }

• });

• JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()

• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }

• });

• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()

• public Integer call(Integer a, Integer b) { return a + b; }

• });

• counts.saveAsTextFile("hdfs://...");

Java vs Scala• Scala : file.flatMap(line => line.split(" "))

• Java version :

• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()

• public Iterable<String> call(String s) {

• return Arrays.asList(s.split(" ")); }

• });

Python• file = spark.textFile("hdfs://...")

• counts = file.flatMap(lambda line: line.split(" ")) \

• .map(lambda word: (word, 1)) \

• .reduceByKey(lambda a, b: a + b)

• counts.saveAsTextFile("hdfs://...")

Highly Recommend

• Scala : Latest API feature, Stable

• Python

• very familiar language

• Native Lib: NumPy, SciPy

FYI• Combiner : ReduceByKey(_+_)

!

• Typical WordCount :

• groupByKey().mapValues{ arr =>

• var r = 0 ; arr.foreach{i=> r+=i} ; r

• }

WordCount

ReduceByKey !reduce a lot in map side

hadoop style shuffle!send a lot data to network

DEMO

• FB 打卡 Yahoo! 徵人訊息，獲得 Yahoo! 沐浴小鴨

• FB打卡說 ”Yahoo! APP超讚!!”

並附上超級商城或新聞APP截圖，即可憑打卡記錄，獲得小鴨護腕墊或購物袋一只

Just memory?• From Matei’s paper: http://0rz.tw/VVqgP

• HBM: stores data in an in-memory HDFS instance.

• SP : Spark

• HBM’1, SP’1 : first run

• Storage: HDFS with 256 MB blocks

• Node information

• m1.xlarge EC2 nodes

• 4 cores

• 15 GB of RAM

http://0rz.tw/VVqgP

100GB data on 100 node cluster

Logistic regression Ru

nnin

g Ti

me(

S)

0

35

70

105

140

HBM'1 HBM SP'1 SP3

4662

139

KMeans

Runn

ing

Tim

e(S)

0

50

100

150

200

HBM'1 HBM SP'1 SP

33

8287

182

There is more• General DAG scheduler

• Control partition shuffle

• Fast driven RPC to launch task

!

• For more info, check http://0rz.tw/jwYwI

http://0rz.tw/jwYwI

Osd ctw spark

Software

public iterable

rdd err worker

mateis paper

1 hbm sp

driver worker

val err

word gt

line gt