Top Banner
Introduction to Apache Spark
28

Intro to Apache Spark - Lab

Apr 14, 2017

Download

Technology

Mammoth Data
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Intro to Apache Spark - Lab

Introduction to Apache Spark

Page 2: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab Overview

● ‘Hello world’ RDD example● Importing a dataset● Dataframe operations and visualizations● Using MLLib on dataset

Page 3: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — Hello World

● ./run_spark

Page 4: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — Hello World

● val text = sc.parallelize(Seq(“your text here”))● val words = text.flatMap(line => line.split(" "))● words.collect

Page 5: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — Hello World

● val taggedWords = words.map(word => (word,1))● val counts = taggedWords.reduceByKey(_ + _)● counts.collect()

Page 6: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — Dataset

● https://archive.ics.uci.edu/ml/datasets/Wine● Information on 3 different types of wine from Genoa● 178 entries (small!)

Page 7: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — Loading The Wine Dataset

● val wines = sqlContext.read.json("wine.json")● wines.registerTempTable(“wines”)

Page 8: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — Showing the generated Schema

● wines.printSchema

Page 9: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — Dataframe Operations

● wines.first

Page 10: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — Dataframe Operations

● sqlContext.sql("SELECT Type, count(Type) AS count FROM wines GROUP BY Type").show

Page 11: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — Dataframe Operations

● Experiment with %sql on the dataset (SELECT, COUNT, etc)

Page 12: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — K-means Clustering

● K-Means clustering is an unsupervised algorithm which splits a dataset into a number of clusters (k) based on a notion of similarity between points. It is often applied to real-world data to obtain a picture of structure hidden in large datasets, for example, identifying location clusters or breaking down sales into distinct purchasing groups.

Page 13: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — K-means Clustering

k initial "means" (in this case k=3) are randomly generated within the

data domain (shown in colour).

Page 14: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — K-means Clustering

k (in this case, 3) clusters are created by comparing each data

point to the closest mean.

Page 15: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — K-means Clustering

The centroid of each of these clusters is found, and these are

used as new means. New clusters are formed via observing the

closest data points to these new mean as shown in Step 2. The

process is repeated until the means converge (or until we hit our

iteration limit)

Page 16: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — K-means Clustering: Imports

● import org.apache.spark.mllib.clustering.KMeansimport org.apache.spark.mllib.linalg.Vectors

● import org.apache.spark.sql._

Page 17: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — K-means Clustering: Features

● val featureCols = wines.select("Alcohol", "Hue", "Proline")● val features = featureCols.rdd.map { case Row(a: Double, h:

Double, p: Double) => Vectors.dense(a,h,p) }● features.cache

Page 18: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — K-means Clustering: Training Model

● val numClusters = 2● val numIterations = 20● val model = KMeans.train(features, numClusters,

numIterations)

Page 19: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — K-means Clustering: Finding k

● k can be any number you like!● WSSSE - Within Set Sum of Squared Error● Squared sum of distances between points and their respective

centroid● val wssse = model.computeCost(features)

Page 20: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — K-means Clustering: Finding k

● Test on k = 1 to 5● (1 to 5 by 1).map (k => KMeans.train(features, k,

numIterations).computeCost(features)) ● WSSSE normally decreases as k increases● Look for the ‘elbow’

Page 21: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — K-means Clustering: Training Model

● val numClusters = 1● val numIterations = 20● val wssse = KMeans.train(features, numClusters,

numIterations).computeCost(features)

Page 22: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — K-means Clustering: k = 3

● val numClusters = 3● val numIterations = 10● val model = KMeans.train(features, numClusters,

numIterations)

Page 23: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — K-means Clustering: Obtaining Type Predictions

● val predictions = features.map ( feature => model.predict(feature))

Page 24: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — K-means Clustering: Comparing To Labels

● val counts = predictions.map (p => (p,1)).reduceByKey(_+_)● counts.collect

Page 25: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — Next Steps

● Looks good, right? Let’s look at what the labels for each point really are.

● val features = featureCols.rdd.map { case Row(t: Double, a: Double, h: Double, p: Double) => (t,Vectors.dense(a,h,p)) }

● val predictions = features.map ( feature => (feature._1, model.predict(feature._2)))

● val counts = predictions.map (p => (p,1)).reduceByKey(_+_)● counts.collect● A slightly different story!

Page 26: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — Next Steps

● k-means clustering - useful! But not perfect!● Try again with more features in the vector and see if it

improves the clustering.● Bayes? Random Forests? All in MLLib and with similar

interfaces!

Page 27: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — Next Steps

● spark.apache.org

Page 28: Intro to Apache Spark - Lab

www.mammothdata.com | @mammothdataco

Lab — Questions

● ?