Introduction to Apache Spark and MLlib

Apache Spark

Pushkar UmaranikarCS 267

Guided by : Dr. Tran

Outline…

• Why Spark ?• Introduction to Spark• Spark Programming model• Installing Spark• Launching Spark on Amazon EC2• Example• Observation• Introduction to MLlib• Kmeans Algorithm

Why Spark?

• MapReduce became popular in complex multi – stage algorithms.

e.g. Machine Learning, iterative graph algorithms.

• Multi-stage and interactive applications require faster data sharing across parallel jobs.

• Why don’t run MapReduce in memory ?

What is Spark ?• Fast , map-Reduce like engine. • Uses in memory cluster computing.• Compatible with Hadoop’s storage API’s.• Has API’s written in Scala, Java, Python.• Useful for large datasets and Iterative algorithms.• Up to 40x faster that Hadoop.• Support for

Shark – Hive on Spark MLlib – Machine learning library GraphX – Graph Processing

History

• Started as a research project at the UC Berkeley AMPLab in 2009, and was open sourced in early 2010.

• After being released, Spark grew a developer community on GitHub and entered Apache in 2013 as its permanent home.

• Codebase sizeSpark : 20,000 LOCHadoop 1.0 : 90,000 LOC

Spark : Programming Model

• Resilient Distributed Datasets (RDDs) are basic building block. Distributed collections of objects that can be cached in memory across

cluster nodes. Automatically rebuilt on failure.

• RDD operations Transformations: Creates new dataset from existing one. e.g. Map. Actions: Return a value to a driver program after running computation

on the dataset. e.g. Reduce.

Data Sharing

• MapReduce

• Spark

Installing Apache Spark

• Get Spark From downloads page of Apache Spark site.

• Go into the top-level Spark directory and run$ sbt/sbt assembly

Installing Apache Spark (2)

Launching Spark on EC2

• Set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

• Go into the ec2 directory in the release of Spark you downloaded.

• Run ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name>

Launching Spark on EC2 (2)

-k <keypair> : Name of EC2 key pair-i <key-file> : Private key of your key pair.-s <num-slaves> : Number of slaves to launch.launch <cluster-name> : Is the name to give your cluster.-r <ec2-region> : Specifies an EC2 region in which to launch instances.

Launching Spark on EC2(3)

Spark on EC2

• Terminating a ClusterRun ./spark-ec2 destroy <cluster-name>

• Stop ClusterRun ./spark-ec2 stop <cluster-name>

• Restarting ClusterRun ./spark-ec2 -i <key-file> start <cluster-name>

• Accessing Data in S3s3n://<bucket>/path

Components

Spark example :Word count• Spark Java API is defined in the org.apache.spark.api.java package, and

includes a JavaSparkContext for initializing Spark.

Location where spark is installed

Name of your application

Collection of JARs to send to cluster. Can be from local or from HDFS

Cluster URL to connect to

Spark Example : Word count(2)

To split the lines into words, we use flatMap to split each line on whitespace

map each word to a (word, 1) pair

Use reduceByKey to count the occurrences of each word

Spark Example : Word count(3)

Word count example : MapReduce

Observation

• Word count program execution time Apache Spark : 13.48s MapReduce : 21.82s

• Run programs faster than Hadoop.

• Spark’s scalable machine learning library.• Currently supports

Binary classification Regression Clustering Collaborative filtering

KMeans Algorithm – Clustering Example

• Parameters – MLlib implementation Number of clusters (k) Max. number of iterations to run Epsilon (converged distance) Initialization mode (random or kmeans||)Initialization steps (number of steps in kmeans||)

KMeans ExampleCluster URL to connect to

File containing input matrix

Number of clusters

Converged distance

KMeans Example(2)

References:

• https://www.youtube.com/watch?v=49Hr5xZyTEA• RDD : A fault tolerant abstraction for in-memory

cluster computing http://www.cs.berkeley.edu/~

matei/papers/2011/tr_spark.pdf• https://spark.apache.org/documentation.html• https://

github.com/mesos/spark/wiki/Spark-Programming-Guide

Introduction to Apache Spark and MLlib

launching

word count

cluster

spark

launch

apache

ec2

word

Data & Analytics

Clustering Uber Pickups using Apache Spark's MLlib

Session #2442: Flash-Optimized Apache Spark: Expanding In...

Spark MLlib - Training Material

HivemallとSpark MLlibの比較

Apache ® Spark™ MLlib 2.x: How to Productionize your...

Accelerator Design for Big Data Processing...

Neural Networks, Spark MLlib, Deep Learning

Apache® Spark™ MLlib 2.x: migrating ML workloads to...

MLlib: Scalable Machine Learning on...

Elasticsearch And Apache Lucene For Apache Spark And MLlib

Reference Architecture · Machine learning – discipline.....

MLlib: Scalable Machine Learning on Spark

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Lightning Fast Cluster Computing - Apache Software...

Apache Spark MLlib - GitHub...

Generalized Linear Models in Spark MLlib and SparkR