Top Banner
event.cwi.nl/lsde Large-Scale Data Engineering Spark and MLLIB
36

Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

Aug 19, 2018

Download

Documents

buidat
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Large-Scale Data Engineering

Spark and MLLIB

Page 2: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

OVERVIEW OF SPARK

Page 3: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

What is Spark? • Fast and expressive cluster computing system interoperable with

Apache Hadoop

• Improves efficiency through:

– In-memory computing primitives

– General computation graphs

• Improves usability through:

– Rich APIs in Scala, Java, Python

– Interactive shell

Up to 100× faster (2-10× on disk)

Often 5× less code

Page 4: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

The Spark Stack

• Spark is the basis of a wide set of projects in the Berkeley Data Analytics

Stack (BDAS)

Spark

Spark Streaming

(real-time)

GraphX (graph)

Spark SQL

MLIB (machine learning)

More details: amplab.berkeley.edu

Page 5: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Why a New Programming Model?

• MapReduce greatly simplified big data analysis

• But as soon as it got popular, users wanted more:

– More complex, multi-pass analytics (e.g. ML, graph)

– More interactive ad-hoc queries

– More real-time stream processing

• All 3 need faster data sharing across parallel jobs

Page 6: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFS read

Slow due to replication, serialization, and disk IO

Page 7: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

iter. 1 iter. 2 . . .

Input

Data Sharing in Spark

Distributed memory

Input

query 1

query 2

query 3

. . .

one-time processing

~10× faster than network and disk

Page 8: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Spark Programming Model • Key idea: resilient distributed datasets (RDDs)

– Distributed collections of objects that can be cached in memory

across the cluster

– Manipulated through parallel operators

– Automatically recomputed on failure

• Programming interface

– Functional APIs in Scala, Java, Python

– Interactive use from Scala shell

Page 9: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache()

Base RDD Transformed RDD

Worker

Worker

Worker

Driver

Page 10: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Lambda Functions

Lambda function functional programming!

= implicit function definition

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

bool detect_error(string x) {

return x.startswith(“ERROR”);

}

Page 11: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Example: Log Mining Load error messages from a log into memory, then interactively search

for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda x: “foo” in x).count

messages.filter(lambda x: “bar” in x).count

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Action

Result: full-text search of Wikipedia in

<1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Page 12: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Fault Tolerance

• file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

filter reduce map

Inp

ut

file

RDDs track lineage info to rebuild lost data

Page 13: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

filter reduce map

Inp

ut

file

• file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

RDDs track lineage info to rebuild lost data

Fault Tolerance

Page 14: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Example: Logistic Regression

0

500

1000

1500

2000

2500

3000

3500

4000

1 5 10 20 30

Ru

nn

ing

Tim

e (

s)

Number of Iterations

Hadoop

Spark

110 s / iteration

first iteration 80 s further iterations 1 s

Page 15: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Spark in Scala and Java

// Scala:

val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()

// Java:

JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();

Page 16: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Supported Operators

•map

•filter

•groupBy

•sort

•union

•join

•leftOuterJoin

•rightOuterJoin

•reduce

•count

•fold

•reduceByKey

•groupByKey

•cogroup

•cross

•zip

sample

take

first

partitionBy

mapWith

pipe

save

...

Page 17: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Software Components

• Spark client is library in user program (1 instance

per app)

• Runs tasks locally or on cluster

– Mesos, YARN, standalone mode

• Accesses storage systems via Hadoop

InputFormat API

– Can use HBase, HDFS, S3, …

Your application

SparkContext

Local

threads

Cluster

manager

Worker

Spark

executor

Worker

Spark

executor

HDFS or other storage

Page 18: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Task Scheduler

General task graphs

Automatically pipelines

functions

Data locality aware

Partitioning aware

to avoid shuffles

= cached partition = RDD

join

filter

groupBy

Stage 3

Stage 1

Stage 2

A: B:

C: D: E:

F:

map

Page 19: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Spark SQL • Columnar SQL analytics engine for Spark

– Support both SQL and complex analytics

– Up to 100X faster than Apache Hive

• Compatible with Apache Hive

– HiveQL, UDF/UDAF, SerDes, Scripts

– Runs on existing Hive warehouses

• In use at Yahoo! for fast in-memory OLAP

Page 20: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Hive Architecture

Hive

Catalog

HDFS

Client

Driver

SQL

Parser

Query

Optimizer

Physical Plan

Execution

CLI JDBC

MapReduce

Page 21: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Spark SQL Architecture

Hive

Catalog

HDFS

Client

Driver

SQL

Parser

Physical Plan

Execution

CLI JDBC

Spark

Cache Mgr.

Query

Optimizer

[Engle et al, SIGMOD 2012]

Page 22: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

What Makes it Faster? • Lower-latency engine (Spark OK with 0.5s jobs)

• Support for general DAGs

• Column-oriented storage and compression

• New optimizations (e.g. map pruning)

Page 23: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Other Spark Stack Projects

• Spark Streaming: stateful, fault-tolerant stream processing (out since Spark 0.7)

•sc.twitterStream(...) .flatMap(_.getText.split(“ ”)) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)

• MLlib: Library of high-quality machine learning algorithms (out since 0.8)

Page 24: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Performance

Imp

ala

(dis

k)

Imp

ala

(mem

)

Red

shif

t

Sp

ark

SQ

L (d

isk)

Sp

ark

SQ

L (m

em)

0

5

10

15

20

25

Re

spo

nse

Tim

e (

s)

SQL

Sto

rm

Sp

ark

0

5

10

15

20

25

30

35

Th

rou

gh

pu

t (M

B/s

/no

de)

Streaming

Had

oo

p

Gir

aph

Gra

ph

X

0

5

10

15

20

25

30

Re

spo

nse

Tim

e (m

in)

Graph

Page 25: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

What it Means for Users

• Separate frameworks:

… HDFS

read

HDFS

write ET

L

HDFS

read

HDFS

write train

HDFS

read

HDFS

write query

HDFS

HDFS

read ET

L

tra

in

qu

ery

Spark:

Page 26: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Conclusion

• Big data analytics is evolving to include:

– More complex analytics (e.g. machine learning)

– More interactive ad-hoc queries

– More real-time stream processing

• Spark is a fast platform that unifies these apps

• More info: spark-project.org

Page 27: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

SPARK MLLIB

Page 28: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

What is MLLIB? MLlib is a Spark subproject providing machine learning primitives:

• initial contribution from AMPLab, UC Berkeley

• shipped with Spark since version 0.8

Page 29: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

What is MLLIB? Algorithms:

• classification: logistic regression, linear support vector machine (SVM),

naive Bayes

• regression: generalized linear regression (GLM)

• collaborative filtering: alternating least squares (ALS)

• clustering: k-means

• decomposition: singular value decomposition (SVD), principal

component analysis (PCA)

Page 30: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Collaborative Filtering

Page 31: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Alternating Least Squares (ALS)

Page 32: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Collaborative Filtering in Spark MLLIB

trainset =

sc.textFile("s3n://bads-music-dataset/train_*.gz")

.map(lambda l: l.split('\t'))

.map(lambda l: Rating(int(l[0]), int(l[1]), int(l[2])))

model = ALS.train(trainset, rank=10, iterations=10) # train

testset = # load testing set

sc.textFile("s3n://bads-music-dataset/test_*.gz")

.map(lambda l: l.split('\t'))

.map(lambda l: Rating(int(l[0]), int(l[1]), int(l[2])))

# apply model to testing set (only first two cols) to predict

predictions =

model.predictAll(testset.map(lambda p: (p[0], p[1])))

.map(lambda r: ((r[0], r[1]), r[2]))

Page 33: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Spark MLLIB – ALS Performance

System Wall-clock /me (seconds)

Matlab 15443

Mahout 4206

GraphLab 291

MLlib 481

• Dataset: Netflix data

• Cluster: 9 machines.

• MLlib is an order of magnitude faster than Mahout.

• MLlib is within factor of 2 of GraphLab.

Page 34: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Spark Implementation of ALS

• Workers load data

• Models are instantiated

at workers.

• At each iteration, models

are shared via join

between workers.

• Good scalability.

• Works on large datasets

Master Workers

Page 35: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

Spark SQL + MLLIB

Page 36: Large-Scale Data Engineering - CWI + MLlib.pdf · event.cwi.nl/lsde What is Spark? •Fast and expressive cluster computing system interoperable with Apache Hadoop •Improves efficiency

event.cwi.nl/lsde

MLLIB Pointers • Website: http://spark.apache.org

• Tutorials: http://ampcamp.berkeley.edu

• Spark Summit: http://spark-summit.org

• Github: https://github.com/apache/spark

• Mailing lists: [email protected] [email protected]