Top Banner
Matei Zaharia Fast and Expressive Big Data Analytics with Python UC BERKELEY spark-project.org UC Berkeley / MIT
31

Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

May 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Matei Zaharia

Fast and Expressive Big Data Analytics with Python

UC BERKELEY

spark-project.org UC Berkeley / MIT

Page 2: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop

Improves efficiency through: » In-memory computing primitives » General computation graphs

Improves usability through: » Rich APIs in Scala, Java, Python » Interactive shell

Up to 100× faster (2-10× on disk)

Often 5× less code

Page 3: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Project History Started in 2009, open sourced 2010

17 companies now contributing code » Yahoo!, Intel, Adobe, Quantifind, Conviva, Bizo, …

Entered Apache incubator in June

Python API added in February

Page 4: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

An Expanding Stack Spark is the basis for a wide set of projects in the Berkeley Data Analytics Stack (BDAS)

Spark

Spark Streaming

(real-time)

GraphX (graph)

Shark (SQL)

MLbase (machine learning)

More details: amplab.berkeley.edu

Page 5: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

This Talk Spark programming model

Examples

Demo

Implementation

Trying it out

Page 6: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Why a New Programming Model?

MapReduce simplified big data processing, but users quickly found two problems:

Programmability: tangle of map/red functions

Speed: MapReduce inefficient for apps that share data across multiple steps » Iterative algorithms, interactive queries

Page 7: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFS read

Slow due to data replication and disk I/O

Page 8: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

iter. 1 iter. 2 . . .

Input

Distributed memory

Input

query 1

query 2

query 3

. . .

one-time processing

10-100× faster than network and disk

What We’d Like

Page 9: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Spark Model Write programs in terms of transformations on distributed datasets

Resilient Distributed Datasets (RDDs) » Collections of objects that can be stored in

memory or disk across a cluster » Built via parallel transformations (map, filter, …) » Automatically rebuilt on failure

Page 10: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) !errors = lines.filter(lambda s: s.startswith(“ERROR”)) !messages = errors.map(lambda s: s.split(“\t”)[2]) !messages.cache() !

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda s: “foo” in s).count() !messages.filter(lambda s: “bar” in s).count()!

. . . !

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Action

Result: full-text search of Wikipedia in 2 sec (vs 30 s for on-disk data)

Result: scaled to 1 TB data in 7 sec (vs 180 sec for on-disk data)

Page 11: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Fault Tolerance RDDs track the transformations used to build them (their lineage) to recompute lost data

messages = textFile(...).filter(lambda s: “ERROR” in s) ! .map(lambda s: s.split(“\t”)[2]) ! !

HadoopRDD  path  =  hdfs://…  

FilteredRDD  func  =  lambda  s:  …  

MappedRDD  func  =  lambda  s:  …  

Page 12: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Example: Logistic Regression

Goal: find line separating two sets of points

+

+ + +

+

+

+ + +

– – –

– – –

+

target

random initial line

Page 13: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Example: Logistic Regression

data = spark.textFile(...).map(readPoint).cache() !!w = numpy.random.rand(D) !!for i in range(iterations): ! gradient = data.map(lambda p: ! (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x! ).reduce(lambda x, y: x + y) ! w -= gradient !!print “Final w: %s” % w !

Page 14: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Logistic Regression Performance

0 500

1000 1500 2000 2500 3000 3500 4000

1 5 10 20 30

Runn

ing

Tim

e (s

)

Number of Iterations

Hadoop PySpark

110 s / iteration

first iteration 80 s further iterations 5 s

Page 15: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Demo

Page 16: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Supported Operators map !

filter !

groupBy!

union !

join !

leftOuterJoin!

rightOuterJoin!

reduce !

count !

fold !

reduceByKey!

groupByKey!

cogroup!

flatMap!

take !

first !

partitionBy!

pipe !

distinct !

save !

... !

Page 17: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Other Engine Features General operator graphs (not just map-reduce)

Hash-based reduces (faster than Hadoop’s sort)

Controlled data partitioning to save communication

171

72

23

0

50

100

150

200

Itera

tion

time

(s)

PageRank Performance

Hadoop

Basic Spark

Spark + Controlled Partitioning

Page 18: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

1000+ meetup members

60+ contributors

17 companies contributing

Spark Community

Page 19: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

This Talk Spark programming model

Examples

Demo

Implementation

Trying it out

Page 20: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Overview Spark core is written in Scala

PySpark calls existing scheduler, cache and networking layer (2K-line wrapper)

No changes to Python

Your app Spark

client

Spark worker

Python child

Python child

PySp

ark

Spark worker

Python child

Python child

Page 21: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Overview Spark core is written in Scala

PySpark calls existing scheduler, cache and networking layer (2K-line wrapper)

No changes to Python

Your app Spark

client

Spark worker

Python child

Python child Py

Spar

k

Spark worker

Python child

Python child

Main PySpark author:

Josh Rosen

cs.berkeley.edu/~joshrosen

Page 22: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Object Marshaling Uses pickle library for both communication and cached data » Much cheaper than Python objects in RAM

Lambda marshaling library by PiCloud

Page 23: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Job Scheduler Supports general operator graphs

Automatically pipelines functions

Aware of data locality and partitioning join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= cached data partition

Page 24: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Interoperability Runs in standard CPython, on Linux / Mac » Works fine with extensions, e.g. NumPy

Input from local file system, NFS, HDFS, S3 » Only text files for now

Works in IPython, including notebook

Works in doctests – see our tests!

Page 25: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Getting Started Visit spark-project.org for video tutorials, online exercises, docs

Easy to run in local mode (multicore), standalone clusters, or EC2

Training camp at Berkeley in August (free video): ampcamp.berkeley.edu

Page 26: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Getting Started Easiest way to learn is the shell:

$ ./pyspark!

>>> nums = sc.parallelize([1,2,3]) # make RDD from array !

>>> nums.count() !3 !

>>> nums.map(lambda x: 2 * x).collect() ![2, 4, 6] !

Page 27: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Writing Standalone Jobs

from pyspark import SparkContext!!if __name__ == "__main__": ! sc = SparkContext(“local”, “WordCount”) ! lines = sc.textFile(“in.txt”) ! ! counts = lines.flatMap(lambda s: s.split()) \ ! .map(lambda word: (word, 1)) \ ! .reduceByKey(lambda x, y: x + y) !! counts.saveAsTextFile(“out.txt”) !!!

Page 28: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Conclusion PySpark provides a fast and simple way to analyze big datasets from Python

Learn more or contribute at spark-project.org

Look for our training camp on August 29-30!

My email: [email protected]

Page 29: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Behavior with Not Enough RAM

68.8

58.1

40.7

29.7

11.5

0

20

40

60

80

100

Cache disabled

25% 50% 75% Fully cached

Iter

atio

n tim

e (s

)

% of working set in memory

Page 30: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

The Rest of the Stack Spark is the foundation for wide set of projects in the Berkeley Data Analytics Stack (BDAS)

Spark

Spark Streaming

(real-time)

GraphX (graph)

Shark (SQL)

MLbase (machine learning)

More details: amplab.berkeley.edu

Page 31: Apache Spark - Fast and Expressive Big Data …Fast and Expressive Big Data Analytics with Python UC BERKELEY UC Berkeley / MIT spark-project.org What is Spark? Fast and expressive

Performance Comparison Im

pal

a (d

isk)

Imp

ala

(mem

)

Red

shift

Shar

k (d

isk)

Shar

k (m

em)

0

5

10

15

20

25

Resp

ons

e T

ime

(s)

SQL

Sto

rm

Spar

k

0

5

10

15

20

25

30

35

Thro

ughp

ut (M

B/s

/no

de)

Streaming

Had

oo

p

Gira

ph

Gra

phL

ab

Gra

phX

0

5

10

15

20

25

30

Resp

ons

e T

ime

(min

) Graph