Top Banner
Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation. All Spark images and code are from http://spark.apache.org/ Leveraging For Cluster Computing Robin M. E. Swezey Rakuten Institute of Technology, Tokyo Intelligence Domain Group [email protected]
21

[Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Jul 03, 2015

Download

Technology

Rakuten, Inc

Rakuten Technology Conference 2014
"Leveraging Spark for Cluster Computing"
Robin M.E. Swezey (Rakuten)
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/

Leveraging For Cluster Computing

Robin M. E. SwezeyRakuten Institute of Technology, Tokyo

Intelligence Domain Group

[email protected]

Page 2: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/2

What is Spark?

Page 3: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/3

In short, Spark is the future of

open-source MapReduce

Page 4: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/4

Current Hadoop stack is heterogeneous

Spark = Fully integrated analytics suite and cluster

computing framework

Berkeley AMP lab + Apache Software Foundation

Why Spark?

Apache Hadoop, Hive, Mahout, Pig and their respective logos are trademarks of The Apache Software Foundation.

Page 5: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/5

On the surface, very similar to Hadoop

• Relies on HDFS

• Runs on Yarn, Mesos, or standalone

• MapReduce + General cluster computing

Platform

Page 6: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/6

1. Resilient Distributed Dataset (RDD)Central to Spark (R dataframe-ish)

Platform

RDD

RDDRDD

Key differences with usual stack

Page 7: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/7

1. Resilient Distributed Dataset (RDD)Central to Spark (R dataframe-ish)

Platform

Key differences with usual stack

RDD<String>RDD<Tuple>RDD<Tuple>

Page 8: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/8

Platform

2. Better resource utilizationDisk is slow. Memory is fast. Several levels of persistence.

Key differences with usual stack

Read blocks

from disk

Cache aggregates

in memory

Page 9: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/9

Platform

Key differences with usual stack

2. Better resource utilizationMore cores > more machines. Resource locality.

Each node x each core

/ each local block

Page 10: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/10

Platform

3. Easier development & operationsScala, Java, Python API

(Logistic Regression)

Key differences with usual stack

8 Lines

Page 11: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/11

Platform

3. Easier AnalyticsInteractive Shells in Scala, Python

Easy to connect with SparkContext (e.g. iPython Notebook)

Key differences with usual stack

Page 12: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/12

4. Integrated Solution

Easy MapReduce

DBMS-like Functionality

Streaming

Machine Learning

Platform

Apache Hadoop, Hive, Mahout, Pig and their respective logos are trademarks of The Apache Software Foundation.

Key differences with usual stack

Page 13: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/13

Applications

Thanks to RDD Distributed Operators

map()

reduce()

reduceByKey()

groupBy()

sample()

pipe()

foreach()

fold()

histogram()

Easy MapReduce

Page 14: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/14

Applications

Cf.

Integrated Unified Data Access

Hive Compatible Standard Connectivity

Sped-up Analytics with DBMS-like SQL Functionality

Page 15: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/15

Applications

Cf.

Streaming

Page 16: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/16

Applications

Cf.

Statistics

Classification / Regression

Collaborative Filtering

Clustering

Dimensionality Reduction

Feature Extraction

Image: http://en.wikipedia.org/wiki/Machine_learning#mediaviewer/File:Linear-svm-scatterplot.svg

Machine Learning

Page 17: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/17

Applications

Flexible

FastPageRank

Connected components

Label propagation

SVD++

Triangle count

Graph Processing

Page 18: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/18

More

There are deployed clusters

of 1,000+ nodes

How does it scale?

Page 19: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/19

More

Spark 1.1.0

had 171 contributors!

There’s open-source, and there’s highly supported open-source

Page 20: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/20

In Conclusion

Hadoop Cluster Computing Hype Cycle

Image: http://upload.wikimedia.org/wikipedia/commons/thumb/9/94/Gartner_Hype_Cycle.svg/2000px-Gartner_Hype_Cycle.svg.png

Page 21: [Rakuten TechConf2014] [C-6] Leveraging Spark for Cluster Computing

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

All Spark images and code are from http://spark.apache.org/21

Thank you!

http://spark.apache.org