Top Banner
Spark Next generation cloud computing engine Wisely Chen
53

Osd ctw spark

Aug 27, 2014

Download

Software

thegiive chen

OSDC.tw 2014 at Taiwan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Osd ctw spark

Spark Next generation cloud

computing engine

Wisely Chen

Page 2: Osd ctw spark

Agenda• What is Spark?

• Next big thing

• How to use Spark?

• Demo

• Q&A

Page 3: Osd ctw spark

Who am I?

• Wisely Chen ( [email protected] )

• Sr. Engineer in Yahoo![Taiwan] data team

• Loves to promote open source tech

• Hadoop Summit 2013 San Jose

• Jenkins Conf 2013 Palo Alto

• Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012

Page 4: Osd ctw spark

Taiwan Data Team

Data!Highway

BI!Report

Serving!API

Data!Mart

ETL /Forecast

Machine!Learning

Page 5: Osd ctw spark

Machine Learning

Distribute Computing

Big Data

Page 6: Osd ctw spark

Recommendation

Forecast

Page 7: Osd ctw spark

HADOOP

Page 8: Osd ctw spark

Faster ML

Distribute Computing

Bigger Big Data

Page 9: Osd ctw spark

Opinion from Cloudera• The leading candidate for “successor to

MapReduce” today is Apache Spark

• No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. !

• From http://0rz.tw/y3OfM

Page 10: Osd ctw spark

What is Spark

• From UC Berkeley AMP Lab

• Most activity Big data open source project since Hadoop

Page 11: Osd ctw spark

Where is Spark?

Page 12: Osd ctw spark

HDFS

YARN

MapReduce

Hadoop 2.0

Storm HBase Others

Page 13: Osd ctw spark

HDFS

YARN

MapReduce

Hadoop Architecture

Hive

Storage

Resource Management

Computing Engine

SQL

Page 14: Osd ctw spark

HDFS

YARN

MapReduce

Hadoop vs Spark

Spark

Hive Shark

Page 15: Osd ctw spark

Spark vs Hadoop• Spark run on Yarn, Mesos or Standalone mode

• Spark’s main concept is based on MapReduce

• Spark can read from

• HDFS: data locality

• HBase

• Cassandra

Page 16: Osd ctw spark

More than MapReduce

HDFS

Spark Core : MapReduce

Shark: Hive GraphX: Pregel MLib: MahoutStreaming:

Storm

Resource Management System(Yarn, Mesos)

Page 17: Osd ctw spark

Why Spark?

Page 18: Osd ctw spark

天下武功,無堅不破,惟快不破

Page 19: Osd ctw spark

3X~25X than MapReduce framework !

From Matei’s paper: http://0rz.tw/VVqgP

Logistic regression

Runn

ing

Tim

e(S)

0

20

40

60

80

MR Spark3

76

KMeans

0

27.5

55

82.5

110

MR Spark

33

106

PageRank

0

45

90

135

180

MR Spark

23

171

Page 20: Osd ctw spark

What is Spark

• Apache Spark™ is a very fast and general engine for large-scale data processing

Page 21: Osd ctw spark

Why is Spark so fast?

Page 22: Osd ctw spark

HDFS

• 100X lower than memory

• Store data into Network+Disk

• Network speed is 100X than memory

• Implement fault tolerance

Page 23: Osd ctw spark

MapReduce Pagerank!

• …..readInputFromHDFS…

• for (int runs = 0; runs < iter_runnumber ; runs++) {

• ………….. • isCompleted = runRankCalculation(inPath,lastResultPath);

• …………

• }

• …..writeOutputToHDFS….

Page 24: Osd ctw spark

Workflow

Input HDFS

Iter 1 RunRank

Tmp HDFS

Iter 2 RunRank

Tmp HDFS

Iter N RunRank

Input HDFS

Iter 1 RunRank

Tmp Mem

Iter 2 RunRank

Tmp Mem

Iter N RunRank

MapReduce

Spark

Page 25: Osd ctw spark

First iteration!take 200 sec

3rd iteration!take 20 sec

Page Rank algorithm in 1 billion record url

2nd iteration!take 20 sec

Page 26: Osd ctw spark

RDD

• Resilient Distributed Dataset

• Collections of objects spread across a cluster, stored in RAM or on Disk

• Built through parallel transformations

Page 27: Osd ctw spark

Fault Tolerance

天下武功,無堅不破,惟快不破

Page 28: Osd ctw spark

RDD

RDD a RDD b

val a =sc.textFile(“hdfs://....”)

val b = a.filer( line=>line.contain(“Spark”) )

Value c

val c = b.count()

Transformation Action

Page 29: Osd ctw spark

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!

Worker!!!!

Worker!!!!Task

TaskTask

Page 30: Osd ctw spark

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!Block1

RDD a

Worker!!!!!Block2

RDD a

Worker!!!!!Block3

RDD a

Page 31: Osd ctw spark

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Block1 Block2

Block3

Page 32: Osd ctw spark

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Block1 Block2

Block3

Page 33: Osd ctw spark

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Cache1 Cache2

Cache3

Page 34: Osd ctw spark

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!

RDD m

Worker!!!!!

RDD m

Worker!!!!!

RDD m

Cache1 Cache2

Cache3

Page 35: Osd ctw spark

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!

RDD a

Worker!!!!!

RDD a

Worker!!!!!

RDD a

Cache1 Cache2

Cache3

Page 36: Osd ctw spark

1st iteration(no cache)!

take same time

with cache!take 7 sec

RDD Cache

Page 37: Osd ctw spark

RDD Cache

• Data locality

• CacheA big shuffle!take 20min

After cache, take only 265ms

self join 5 billion record data

Page 38: Osd ctw spark

Easy to use

• Interactive Shell

• Multi Language API

• JVM: Scala, JAVA

• PySpark: Python

Page 39: Osd ctw spark

Scala Word Count• val file = spark.textFile("hdfs://...")

• val counts = file.flatMap(line => line.split(" "))

• .map(word => (word, 1))

• .reduceByKey(_ + _)

• counts.saveAsTextFile("hdfs://...")

Page 40: Osd ctw spark

Step by Step

• file.flatMap(line => line.split(" “)) => (aaa,bb,cc)

• .map(word => (word, 1)) => ((aaa,1),(bb,1)..)

• .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)

Page 41: Osd ctw spark

Java Wordcount• JavaRDD<String> file = spark.textFile("hdfs://...");

• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()

• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }

• });

• JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()

• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }

• });

• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()

• public Integer call(Integer a, Integer b) { return a + b; }

• });

• counts.saveAsTextFile("hdfs://...");

Page 42: Osd ctw spark

Java vs Scala• Scala : file.flatMap(line => line.split(" "))

• Java version :

• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()

• public Iterable<String> call(String s) {

• return Arrays.asList(s.split(" ")); }

• });

Page 43: Osd ctw spark

Python• file = spark.textFile("hdfs://...")

• counts = file.flatMap(lambda line: line.split(" ")) \

• .map(lambda word: (word, 1)) \

• .reduceByKey(lambda a, b: a + b)

• counts.saveAsTextFile("hdfs://...")

Page 44: Osd ctw spark

Highly Recommend

• Scala : Latest API feature, Stable

• Python

• very familiar language

• Native Lib: NumPy, SciPy

Page 45: Osd ctw spark

FYI• Combiner : ReduceByKey(_+_)

!

• Typical WordCount :

• groupByKey().mapValues{ arr =>

• var r = 0 ; arr.foreach{i=> r+=i} ; r

• }

Page 46: Osd ctw spark

WordCount

ReduceByKey !reduce a lot in map side

hadoop style shuffle!send a lot data to network

Page 47: Osd ctw spark

DEMO

Page 48: Osd ctw spark

• FB 打卡 Yahoo! 徵人訊息,獲得 Yahoo! 沐浴小鴨

• FB打卡說 ”Yahoo!  APP超讚!!”

並附上超級商城或新聞APP截圖,即可憑打卡記錄,獲得小鴨護腕墊或購物袋一只

Page 49: Osd ctw spark

Just memory?• From Matei’s paper: http://0rz.tw/VVqgP

• HBM: stores data in an in-memory HDFS instance.

• SP : Spark

• HBM’1, SP’1 : first run

• Storage: HDFS with 256 MB blocks

• Node information

• m1.xlarge EC2 nodes

• 4 cores

• 15 GB of RAM

Page 50: Osd ctw spark

100GB data on 100 node cluster

Logistic regression Ru

nnin

g Ti

me(

S)

0

35

70

105

140

HBM'1 HBM SP'1 SP3

4662

139

KMeans

Runn

ing

Tim

e(S)

0

50

100

150

200

HBM'1 HBM SP'1 SP

33

8287

182

Page 51: Osd ctw spark

There is more• General DAG scheduler

• Control partition shuffle

• Fast driven RPC to launch task

!

• For more info, check http://0rz.tw/jwYwI

Page 52: Osd ctw spark
Page 53: Osd ctw spark