Top Banner
Introduction to Spark Matei Zaharia Databricks Intern Event, August 2015
15

Introduction to Spark (Intern Event Presentation)

Apr 15, 2017

Download

Software

Databricks
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Spark (Intern Event Presentation)

Introduction to Spark

Matei Zaharia Databricks Intern Event, August 2015

Page 2: Introduction to Spark (Intern Event Presentation)

What is Apache Spark?

Fast and general computing engine for clusters

Makes it easy and fast to process large datasets • APIs in Java, Scala, Python, R • Libraries for SQL, streaming, machine learning, … • 100x faster than Hadoop MapReduce for some apps

Page 3: Introduction to Spark (Intern Event Presentation)

About Databricks

Founded by creators of Spark in 2013

Offers a hosted cloud service built on Spark •  Interactive workspace with notebooks, dashboards, jobs

Page 4: Introduction to Spark (Intern Event Presentation)

0 20 40 60 80

100 120 140 160

2010 2011 2012 2013 2014 2015

Cont

ribut

ors

Contributors / Month to Spark

Community Growth

Most active open source project in big data

Page 5: Introduction to Spark (Intern Event Presentation)

Spark Programming Model

Write programs in terms of transformations on distributed datasets

Resilient Distributed Datasets (RDDs) • Collections of objects stored in memory or disk across a cluster • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure

Page 6: Introduction to Spark (Intern Event Presentation)

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns

lines  =  spark.textFile(“hdfs://...”)  

errors  =  lines.filter(lambda  s:  s.startswith(“ERROR”))  

messages  =  errors.map(lambda  s:  s.split(‘\t’)[2])  

messages.cache()  Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(lambda  s:  “MySQL”  in  s).count()  

messages.filter(lambda  s:  “Redis”  in  s).count()  

.  .  .  

tasks

results Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Action

Result: full-text search of Wikipedia in 0.5 sec (vs 20s for on-disk data)

Page 7: Introduction to Spark (Intern Event Presentation)

Example: Logistic Regression

0 500

1000 1500 2000 2500 3000 3500 4000

1 5 10 20 30

Runn

ing

Tim

e (s

)

Number of Iterations

Hadoop Spark

110 s / iteration

first iteration 80 s further iterations 1 s

Iterative algorithm used in machine learning

Page 8: Introduction to Spark (Intern Event Presentation)

Source: Daytona GraySort benchmark, sortbenchmark.org

2100 machines 2013 Record: Hadoop

72 minutes

2014 Record: Spark

207 machines

23 minutes

On-Disk Performance Time to sort 100TB

Page 9: Introduction to Spark (Intern Event Presentation)

Higher-Level Libraries

Spark

Spark Streaming

real-time

Spark SQL structured data

MLlib machine learning

GraphX graph

Page 10: Introduction to Spark (Intern Event Presentation)

Higher-Level Libraries

//  Load  data  using  SQL  points  =  ctx.sql(“select  latitude,  longitude  from  tweets”)  

//  Train  a  machine  learning  model  model  =  KMeans.train(points,  10)  

//  Apply  it  to  a  stream  sc.twitterStream(...)      .map(lambda  t:  (model.predict(t.location),  1))      .reduceByWindow(“5s”,  lambda  a,  b:  a  +  b)  

Page 11: Introduction to Spark (Intern Event Presentation)

Demo

Page 12: Introduction to Spark (Intern Event Presentation)

Over 1000 production users, clusters up to 8000 nodes

Many talks online at spark-summit.org

Spark Community

Page 13: Introduction to Spark (Intern Event Presentation)
Page 14: Introduction to Spark (Intern Event Presentation)

Ongoing Work

Speeding up Spark through code generation and binary processing (Project Tungsten)

R interface to Spark (SparkR)

Real-time machine learning library

Frontend and backend work in Databricks (visualization, collaboration, auto-scaling, …)

Page 15: Introduction to Spark (Intern Event Presentation)

Thank you. We’re hiring!