CS435 Introduction to Big Data Fall 2019 Colorado State University 11/4/2019 Week 11-A Sangmi Lee Pallickara 1 11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.0 CS435 Introduction to Big Data PART 1. LARGE SCALE DATA ANALYTICS IN-MEMORY CLUSTER COMPUTING Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs435 11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.1 FAQs 11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.2 Today’s topics • In-Memory cluster computing • Apache Spark 11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.3 Large Scale Data Analytics In-Memory Cluster Computing: Apache Spark Key-Value pairs : Actions available on Pair RDDs 11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.4 Actions on pair RDDs(example({(1,2),(3,4),(3,6)})) Function Description Example Result countByKey() Count the number of elements for each key rdd.countByKey() {(1,1),(3,2)} collectAsMap() Collect the result as a map to provide easy lookup at the driver rdd.collectAsMap() Map{(1,2),(3,4),(3,6)} lookup(key) Return all values associated with the provided key rdd.lookup(3) [4,6] 11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.5 Large Scale Data Analytics In-Memory Cluster Computing: Apache Spark Data Partitioning
10
Embed
CS435 Introduction to Big Data 11/4/2019 Week 11-A Fall ...cs435/slides/week11-A-6.pdfIn-Memory Cluster Computing: Apache Spark Software stack 11/4/2019 CS435 Introductionto Big Data
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CS435 Introduction to Big DataFall 2019 Colorado State University
11/4/2019 Week 11-ASangmi Lee Pallickara
1
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.0
CS435 Introduction to Big Data
PART 1. LARGE SCALE DATA ANALYTICSIN-MEMORY CLUSTER COMPUTINGSangmi Lee Pallickara
Computer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs435
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.1
FAQs
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.2
Today’s topics
• In-Memory cluster computing • Apache Spark
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.3
Large Scale Data AnalyticsIn-Memory Cluster Computing: Apache Spark
Key-Value pairs : Actions available on Pair RDDs
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.4
Actions on pair RDDs(example({(1,2),(3,4),(3,6)}))
Function Description Example Result
countByKey() Count the number of elements for each key
rdd.countByKey() {(1,1),(3,2)}
collectAsMap() Collect the result as a map to provide easy lookup at the driver
rdd.collectAsMap() Map{(1,2),(3,4),(3,6)}
lookup(key) Return all values associated with the provided key
rdd.lookup(3) [4,6]
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.5
Large Scale Data AnalyticsIn-Memory Cluster Computing: Apache Spark
Data Partitioning
CS435 Introduction to Big DataFall 2019 Colorado State University
11/4/2019 Week 11-ASangmi Lee Pallickara
2
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.6
Why partitioning?• Consider an application that keeps a large table of user information in
memory • An RDD of (UserID, UserInfo) pairs• The application periodically combines this table with a smaller file representing
events that happened in the last five minutes
.
.
.
.
.
.
.
.
.
User datajoined Event data
Network communication
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.7
Using partitionBy()• Transforms userData to hash-partitioned RDD à creates a new RDD
repartition() : runs repartition over RDD in memory
.
.
.
.
.
.
.
.
.
User datajoined
Event data
Network communicationLocal reference
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.8
Large Scale Data AnalyticsIn-Memory Cluster Computing: Apache Spark
Map vs. Filter vs. FlatMap
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.9
Example: Word Counthttps://raw.githubusercontent.com/apache/spark/master/examples/src/main/java/org/apache/spark/examples/JavaWordCount.java
public final class JavaWordCount { private static final Pattern SPACE = Pattern.compile(" "); public static void main(String[] args) throws Exception {
if (args.length < 1) { System.err.println("Usage: JavaWordCount <file>"); System.exit(1);
CS435 Introduction to Big DataFall 2019 Colorado State University
11/4/2019 Week 11-ASangmi Lee Pallickara
3
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.12
map() vs. filter() vs. flatMap() [1/3]
• The map() transformation takes in a function and applies it to each element in the RDD with the result of the function being the new value of each element in the resulting RDD
• The filter() transformation takes in a function and returns an RDD that only has elements that pass the filter() function
• The flatMap() is similar to map, but each input item can be mapped to 0 or more output items (so funcshould return a Seq rather than a single item).
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.13
inputRDD
{1,2,3,4}
MappedRDD
{1,4,9,16}
filteredRDD
{2,3,4}
map x=> x*x filter x !=1
map() vs. filter() vs. flatMap() [2/3]
flatMap
{1,2,3,4,5,2,3,4,5,3,4,5,4,5}
flatMap x=> (x to 5)
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.14
• map() that squares all of the numbers in an RDD
JavaRDD <Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4)); JavaRDD <Integer> result = rdd.map(new Function < Integer, Integer >() {
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.17
take(n)
• returns n elements from the RDD and attempts to minimize the number of partitions it accesses• It may represent a biased collection• It does not return the elements in the order you might expect• Useful for unit testing
CS435 Introduction to Big DataFall 2019 Colorado State University
11/4/2019 Week 11-ASangmi Lee Pallickara
4
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.18
Large Scale Data AnalyticsIn-Memory Cluster Computing: Apache Spark
Persistence
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.19
public static void main(String[] args) { SparkSession spark = SparkSession
for (Tuple2<?,?> t : output) { System.out.println(t._1() + "\t" + t._2());
} spark.stop(); } }
Hard-coded example
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.20
Persistence levelslevel Space
usedCPU time
In memory/On disk
Comment
MEMORY_ONLY High Low Y/N
MEMORY_ONLY_SER Low High Y/N Store RDD as serialized Java objects (one byte array per partition).
MEMORY_AND_DISK High Medium Some/Some Spills to disk if there is too much data to fit in memory
MEMORY_AND_DISK_SER Low High Some/Some Spills to disk if there is too much data to fit in memory. Stores serialized representation in memory
DISK_ONLY Low High N/Y
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.21
Large Scale Data AnalyticsIn-Memory Cluster Computing: Apache Spark
Software stack
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.22
Spark stack of libraries• SQL and DataFrames• Machine learning• GraphX• Spark Streaming
SparkSQL
SparkStreami-
ng
SparkMLlib
SparkGraphX
Apache Spark
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.23
Large Scale Data AnalyticsIn-Memory Cluster Computing: Apache Spark
Predicting Forest Cover with Decision Trees and Forests
CS435 Introduction to Big DataFall 2019 Colorado State University
11/4/2019 Week 11-ASangmi Lee Pallickara
5
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.24
Regressions and Classifications
• Regression analysis• A statistical process for estimating the relationships among variables• Dependent variable vs. independent variable (“predictors”)• How the typical value of the dependent variable changes when any one of the independent
variables is varied, while the other independent variables are held fixed• Predicting numerical values
• Income, temperature, size
• Classifications• Identifying which of a set of categories a new observation belongs• Predicting a label or category
• Spam, picture of a cat
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.25
Features and Feature Vectors
• Input and output to regression and classification• Example
• Predicting tomorrows temperature given today’s weather• Features (dimensions, predictors, or variables) of today’s weather
• Today’s high temperature• Today’s low temperature• Today’s average humidity• Whether it’s cloudy, rainy, or clear today• The number of weather forecasters predicting a cold snap tomorrow
• Each of these features can be quantified• (13.1, 19.0, 0.73, cloudy, 1)
• Feature vector
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.26
Features and Feature Vectors
• Feature vector• These features together, in order• (13.1, 19.0, 0.73, cloudy, 1)• Features do NOT need to be the same type
• Categorical features• Features that have no ordering
• Cloudy, clear
• Numeric features• Features that can be quantified by a number and have a meaningful ordering
• 23F, 56F (23F < 56F)
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.27
Training Examples
• A learning algorithm needs to train on data in order to make predictions• Inputs• Correct outputs (from historical data)• “One day, the weather was between 12 and 16 degrees Celsius, with 10% humidity,
clear, with no forecast of a cold snap, and the following day the high temperature was 17.2 degrees”• Target (output)
• 17.2 degree
• Feature vector often includes the target value• (13.1, 19.0, 0.73, cloudy, 1, 17.2)
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.28
Decision Trees and Forests
• Decision trees can naturally handle both categorical and numeric features• Easy to parallelize• Robust to outliers
• A few extreme and possibly erroneous data points may not affect predictions at all
• Random Decision Forests• Extended decision tree algorithm
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.29
Decision tree• Spark MLlib’s DecisionTree and RandomForest implementation• Each decision leads to one of two results• Prediction or another decision
Has the expiration date passed?
Not spoiled
Was the expiration date more than 3 days ago?
Does it smell funny?Not spoiled
spoiledNot
spoiled
Yes
Yes
Yes
No
No
No
CS435 Introduction to Big DataFall 2019 Colorado State University
11/4/2019 Week 11-ASangmi Lee Pallickara
6
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.30
Example: Finding a good petname Weight (hg) # of legs color Good pet?
Fido 20.5 4 Brown Y
Mr. Slither 3.1 0 Green N
Nemo 0.2 0 Tan Y
Dumbo 1390.8 4 Grey N
Kitty 12.1 4 Grey Y
Jim 150.9 2 Tan N
Millie 0.1 100 Brown N
McPigeon 1.0 2 Grey N
Spot 10.0 4 Brown Y
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.31
Decision tree for “Finding a good pet” example Weight >= 100kg?
Not
suitable Is color green
yes
suitable
Not
suitable
Yes
Yes
No
No
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.32
Covertype dataset
• Dataset with records of the types of forest covering parcels of land in Colorado, USA• Each example contains several features describing
• Each parcel of land • Elevation, slope, distance to water, shade, and soil type
• The forest cover type is to be predicted• From the rest of features• Total 54 features
• Used in Kaggle competition• It includes categorical and numeric features• 581,012 examples
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.33
Attribute informationName/ Data Type/ Measurement/ Description
Elevation / quantitative /meters / Elevation in meters Aspect / quantitative / azimuth / Aspect in degrees azimuth Slope / quantitative / degrees / Slope in degrees Horizontal_Distance_To_Hydrology / quantitative / meters / Horz Dist to nearest surface water features Vertical_Distance_To_Hydrology / quantitative / meters / Vert Dist to nearest surface water features Horizontal_Distance_To_Roadways / quantitative / meters / Horz Dist to nearest roadway Hillshade_9am / quantitative / 0 to 255 index / Hillshade index at 9am, summer solstice Hillshade_Noon / quantitative / 0 to 255 index / Hillshade index at noon, summer solticeHillshade_3pm / quantitative / 0 to 255 index / Hillshade index at 3pm, summer solstice Horizontal_Distance_To_Fire_Points / quantitative / meters / Horz Dist to nearest wildfire ignition points Wilderness_Area (4 binary columns) / qualitative / 0 (absence) or 1 (presence) / Wilderness area designation Soil_Type (40 binary columns) / qualitative / 0 (absence) or 1 (presence) / Soil Type designation Cover_Type (7 types) / integer / 1 to 7 / Forest Cover Type designation
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.34
Preparing data
• The covtype.data file should be extracted and copied into HDFS• File is available at /user/ds/
• LabeledPoint• The Spark MLlib abstraction for a feature vector • Consists of a Spark MLlib Vector of features, and a target value (label)• LabeledPoint is only for numeric features
• It can be used with categorical features, with appropriate encoding
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.35
Using LabeledPoint for the categorical features
• One-hot coding• One categorical feature that takes on N distinct values becomes N numeric features, each
taking on the value 0 or 1• Exactly one of the N values have value 1 and the others are 0• Cloudy, rainy or clear
• Cloudy: 1,0,0• Rainy: 0,1,0• Clear: 0,0,1
• 1-of-n coding• Cloudy: 1• Rainy: 2• Clear: 3
CS435 Introduction to Big DataFall 2019 Colorado State University
11/4/2019 Week 11-ASangmi Lee Pallickara
7
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.36
Categorical values in Covtype data set
• The covtype.info file says that four of the columns are actually a one-
hot encoding of a single categorical feature
• Wilderness_Type, with four values
• Likewise, 40 of the columns are really one Soil_Type categorical feature
• The target itself is a categorical value encoded as the values 1 to 7
• The remaining features are numeric features in various units, like meters,
degrees, or a qualitative “index” value
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.37
A First Decision Tree
• Spark MLlib requires input in the form of LabeledPoint objectsimport org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.regression._
val rawData = sc.textFile(“hdfs:///user/ds/covtype.data”)
val data = rawData.map { line = > val values = line.split(',').map(_.toDouble) val featureVector = Vectors.dense(values.init)
val label = values.last - 1 LabeledPoint(label, featureVector)
}
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.38
Splitting data• Training, cross-validation, and test• 80% of data for training and 10% each for cross-validation and test• Training and CV sets are used to choose a good setting of hyperparameters
for this data set• Test set is used to produce an unbiased evaluation of the expected
accuracy of a model built with those hyperparameters
val Array( trainData, cvData, testData) = data.randomSplit(Array(0.8, 0.1, 0.1))
trainData.cache()
cvData.cache()
testData.cache()
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.39
Building a DecisionTreeModel on the training set
• Building a DecisionTreeModel on the training set with some default arguments
• Compute some metrics about the resulting model using the CV set
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.40
Building a DecisionTreeModel on the training setimport org.apache.spark.mllib.evaluation._ import org.apache.spark.mllib.tree._ import org.apache.spark.mllib.tree.model._ import org.apache.spark.rdd._
Converting one-hot encoding to 1-n encoding [3/3]11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.55
Does decision tree algorithm build the same tree every time?
• Over N values• There are 2N-2 possible decision rules
• Decision trees use several heuristics to narrow down the rules to be considered• The process of picking rules involves some randomness• Only a few features, picked at random, are looked at each time• Only values from a random subset of the training data are looked• Trades a bit of accuracy for a lot of speed
• Decision tree algorithm won’t build the same tree every time
11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.56
RandomForest
• Number of trees to build• Here, 20
• “auto”• The strategy for choosing which features to evaluate at each level of the
tree• The random decision forest implementation will NOT even consider every
feature as the basis of a decision rule• Only a subset of all features