CS435 Introduction to Big Data 11/4/2019 Week 11-A Fall ...cs435/slides/week11-A-6.pdfIn-Memory Cluster Computing: Apache Spark Software stack 11/4/2019 CS435 Introductionto Big Data

CS435 Introduction to Big DataFall 2019 Colorado State University

11/4/2019 Week 11-ASangmi Lee Pallickara

1

11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.0

CS435 Introduction to Big Data

PART 1. LARGE SCALE DATA ANALYTICSIN-MEMORY CLUSTER COMPUTINGSangmi Lee Pallickara

Computer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs435


FAQs


Today’s topics

• In-Memory cluster computing • Apache Spark


Large Scale Data AnalyticsIn-Memory Cluster Computing: Apache Spark

Key-Value pairs : Actions available on Pair RDDs


Actions on pair RDDs(example({(1,2),(3,4),(3,6)}))

Function Description Example Result

countByKey() Count the number of elements for each key

rdd.countByKey() {(1,1),(3,2)}

collectAsMap() Collect the result as a map to provide easy lookup at the driver

rdd.collectAsMap() Map{(1,2),(3,4),(3,6)}

lookup(key) Return all values associated with the provided key

rdd.lookup(3) [4,6]



Data Partitioning



2


Why partitioning?• Consider an application that keeps a large table of user information in

memory • An RDD of (UserID, UserInfo) pairs• The application periodically combines this table with a smaller file representing

events that happened in the last five minutes

.

.

.

.

.

.

.

.

.

User datajoined Event data

Network communication


Using partitionBy()• Transforms userData to hash-partitioned RDD à creates a new RDD

repartition() : runs repartition over RDD in memory

.

.

.

.

.

.

.

.

.

User datajoined

Event data

Network communicationLocal reference



Map vs. Filter vs. FlatMap


Example: Word Counthttps://raw.githubusercontent.com/apache/spark/master/examples/src/main/java/org/apache/spark/examples/JavaWordCount.java

package org.apache.spark.examples; import scala.Tuple2; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.sql.SparkSession; import java.util.Arrays; import java.util.List; import java.util.regex.Pattern;

public final class JavaWordCount { private static final Pattern SPACE = Pattern.compile(" "); public static void main(String[] args) throws Exception {

if (args.length < 1) { System.err.println("Usage: JavaWordCount <file>"); System.exit(1);

} SparkSession spark = SparkSession

.builder()

.appName("JavaWordCount")

.getOrCreate();

A compiled representation of a regular expression

Provide the app name

JavaRDD<String> lines = spark.read().textFile(args[0]).javaRDD();

JavaRDD<String> words = lines.flatMap(s ->

Arrays.asList(SPACE.split(s)).iterator()); JavaPairRDD<String, Integer> ones = words.mapToPair(s -> new Tuple2<>(s,

1)); JavaPairRDD<String, Integer> counts = ones.reduceByKey((i1, i2) -> i1 + i2);

List<Tuple2<String, Integer>> output = counts.collect(); for (Tuple2<?,?> tuple : output) {

System.out.println(tuple._1() + ": " + tuple._2());

}

spark.stop(); }

}

Generating an RDD from the file

FlatMap: Each item can be mapped to one or

more output items

Tokenizing a string

Bring them back to the driver program



3


map() vs. filter() vs. flatMap() [1/3]

• The map() transformation takes in a function and applies it to each element in the RDD with the result of the function being the new value of each element in the resulting RDD

• The filter() transformation takes in a function and returns an RDD that only has elements that pass the filter() function

• The flatMap() is similar to map, but each input item can be mapped to 0 or more output items (so funcshould return a Seq rather than a single item).


inputRDD

{1,2,3,4}

MappedRDD

{1,4,9,16}

filteredRDD

{2,3,4}

map x=> x*x filter x !=1

map() vs. filter() vs. flatMap() [2/3]

flatMap

{1,2,3,4,5,2,3,4,5,3,4,5,4,5}

flatMap x=> (x to 5)


• map() that squares all of the numbers in an RDD

JavaRDD <Integer> rdd = sc.parallelize(Arrays.asList(1, 2, 3, 4)); JavaRDD <Integer> result = rdd.map(new Function < Integer, Integer >() {

public Integer call(Integer x) { return x*x;

} });

System.out.println(StringUtils.join(result.collect(),","));

map() vs. filter() vs. flatMap() [3/3]11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.15

map() vs. flatMap() with String [1/2]

• As results of flatMap(), we have an RDD of the elements• Instead of RDD of lists of elements

RDD1{“coffee panda”, “happy panda”, ”happiest panda

party”}

mappedRDD{[“coffee”,”panda”],[“happy”,”panda”],[“happiest”,”panda”,”party”]}

RDD1.map(tokenize)

flatMappedRDD{“coffee”,”panda”,“happy”,”panda”

,“happiest”,”panda”,”party”}

RDD1.flatMap(tokenize)


• Using flatMap() that splits lines to multiple words

JavaRDD < String > lines = sc.parallelize( Arrays.asList(" hello world", "hi"));

JavaRDD < String > words = lines.flatMap(new FlatMapFunction < String, String >() {

public Iterable < String > call( String line) { return Arrays.asList(line.split(" "));

} });

words.first(); // returns "hello”

map() vs. flatMap() with String [2/2]


take(n)

• returns n elements from the RDD and attempts to minimize the number of partitions it accesses• It may represent a biased collection• It does not return the elements in the order you might expect• Useful for unit testing



4



Persistence


public static void main(String[] args) { SparkSession spark = SparkSession

.builder()

.appName("JavaLogQuery")

.getOrCreate(); JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext()); JavaRDD<String> dataSet = (args.length == 1) ? jsc.textFile(args[0]) :

jsc.parallelize(exampleApacheLogs); JavaPairRDD<Tuple3<String, String, String>, Stats> extracted =

dataSet.mapToPair(s -> new Tuple2<>(extractKey(s), extractStats(s))); JavaPairRDD<Tuple3<String, String, String>, Stats> counts =

extracted.reduceByKey(Stats::merge); List<Tuple2<Tuple3<String, String, String>, Stats>> counts = counts.collect();

for (Tuple2<?,?> t : output) { System.out.println(t._1() + "\t" + t._2());

} spark.stop(); } }

Hard-coded example


Persistence levelslevel Space

usedCPU time

In memory/On disk

Comment

MEMORY_ONLY High Low Y/N

MEMORY_ONLY_SER Low High Y/N Store RDD as serialized Java objects (one byte array per partition).

MEMORY_AND_DISK High Medium Some/Some Spills to disk if there is too much data to fit in memory

MEMORY_AND_DISK_SER Low High Some/Some Spills to disk if there is too much data to fit in memory. Stores serialized representation in memory

DISK_ONLY Low High N/Y



Software stack


Spark stack of libraries• SQL and DataFrames• Machine learning• GraphX• Spark Streaming

SparkSQL

SparkStreami-

ng

SparkMLlib

SparkGraphX

Apache Spark



Predicting Forest Cover with Decision Trees and Forests



5


Regressions and Classifications

• Regression analysis• A statistical process for estimating the relationships among variables• Dependent variable vs. independent variable (“predictors”)• How the typical value of the dependent variable changes when any one of the independent

variables is varied, while the other independent variables are held fixed• Predicting numerical values

• Income, temperature, size

• Classifications• Identifying which of a set of categories a new observation belongs• Predicting a label or category

• Spam, picture of a cat


Features and Feature Vectors

• Input and output to regression and classification• Example

• Predicting tomorrows temperature given today’s weather• Features (dimensions, predictors, or variables) of today’s weather

• Today’s high temperature• Today’s low temperature• Today’s average humidity• Whether it’s cloudy, rainy, or clear today• The number of weather forecasters predicting a cold snap tomorrow

• Each of these features can be quantified• (13.1, 19.0, 0.73, cloudy, 1)

• Feature vector


Features and Feature Vectors

• Feature vector• These features together, in order• (13.1, 19.0, 0.73, cloudy, 1)• Features do NOT need to be the same type

• Categorical features• Features that have no ordering

• Cloudy, clear

• Numeric features• Features that can be quantified by a number and have a meaningful ordering

• 23F, 56F (23F < 56F)


Training Examples

• A learning algorithm needs to train on data in order to make predictions• Inputs• Correct outputs (from historical data)• “One day, the weather was between 12 and 16 degrees Celsius, with 10% humidity,

clear, with no forecast of a cold snap, and the following day the high temperature was 17.2 degrees”• Target (output)

• 17.2 degree

• Feature vector often includes the target value• (13.1, 19.0, 0.73, cloudy, 1, 17.2)


Decision Trees and Forests

• Decision trees can naturally handle both categorical and numeric features• Easy to parallelize• Robust to outliers

• A few extreme and possibly erroneous data points may not affect predictions at all

• Random Decision Forests• Extended decision tree algorithm


Decision tree• Spark MLlib’s DecisionTree and RandomForest implementation• Each decision leads to one of two results• Prediction or another decision

Has the expiration date passed?

Not spoiled

Was the expiration date more than 3 days ago?

Does it smell funny?Not spoiled

spoiledNot

spoiled

Yes

Yes

Yes

No

No

No



6


Example: Finding a good petname Weight (hg) # of legs color Good pet?

Fido 20.5 4 Brown Y

Mr. Slither 3.1 0 Green N

Nemo 0.2 0 Tan Y

Dumbo 1390.8 4 Grey N

Kitty 12.1 4 Grey Y

Jim 150.9 2 Tan N

Millie 0.1 100 Brown N

McPigeon 1.0 2 Grey N

Spot 10.0 4 Brown Y


Decision tree for “Finding a good pet” example Weight >= 100kg?

Not

suitable Is color green

yes

suitable

Not

suitable

Yes

Yes

No

No


Covertype dataset

• Dataset with records of the types of forest covering parcels of land in Colorado, USA• Each example contains several features describing

• Each parcel of land • Elevation, slope, distance to water, shade, and soil type

• The forest cover type is to be predicted• From the rest of features• Total 54 features

• Used in Kaggle competition• It includes categorical and numeric features• 581,012 examples

• https://archive.ics.uci.edu/ml/datasets/Covertype


Attribute informationName/ Data Type/ Measurement/ Description

Elevation / quantitative /meters / Elevation in meters Aspect / quantitative / azimuth / Aspect in degrees azimuth Slope / quantitative / degrees / Slope in degrees Horizontal_Distance_To_Hydrology / quantitative / meters / Horz Dist to nearest surface water features Vertical_Distance_To_Hydrology / quantitative / meters / Vert Dist to nearest surface water features Horizontal_Distance_To_Roadways / quantitative / meters / Horz Dist to nearest roadway Hillshade_9am / quantitative / 0 to 255 index / Hillshade index at 9am, summer solstice Hillshade_Noon / quantitative / 0 to 255 index / Hillshade index at noon, summer solticeHillshade_3pm / quantitative / 0 to 255 index / Hillshade index at 3pm, summer solstice Horizontal_Distance_To_Fire_Points / quantitative / meters / Horz Dist to nearest wildfire ignition points Wilderness_Area (4 binary columns) / qualitative / 0 (absence) or 1 (presence) / Wilderness area designation Soil_Type (40 binary columns) / qualitative / 0 (absence) or 1 (presence) / Soil Type designation Cover_Type (7 types) / integer / 1 to 7 / Forest Cover Type designation


Preparing data

• The covtype.data file should be extracted and copied into HDFS• File is available at /user/ds/

• LabeledPoint• The Spark MLlib abstraction for a feature vector • Consists of a Spark MLlib Vector of features, and a target value (label)• LabeledPoint is only for numeric features

• It can be used with categorical features, with appropriate encoding


Using LabeledPoint for the categorical features

• One-hot coding• One categorical feature that takes on N distinct values becomes N numeric features, each

taking on the value 0 or 1• Exactly one of the N values have value 1 and the others are 0• Cloudy, rainy or clear

• Cloudy: 1,0,0• Rainy: 0,1,0• Clear: 0,0,1

• 1-of-n coding• Cloudy: 1• Rainy: 2• Clear: 3



7


Categorical values in Covtype data set

• The covtype.info file says that four of the columns are actually a one-

hot encoding of a single categorical feature

• Wilderness_Type, with four values

• Likewise, 40 of the columns are really one Soil_Type categorical feature

• The target itself is a categorical value encoded as the values 1 to 7

• The remaining features are numeric features in various units, like meters,

degrees, or a qualitative “index” value


A First Decision Tree

• Spark MLlib requires input in the form of LabeledPoint objectsimport org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.regression._

val rawData = sc.textFile(“hdfs:///user/ds/covtype.data”)

val data = rawData.map { line = > val values = line.split(',').map(_.toDouble) val featureVector = Vectors.dense(values.init)

val label = values.last - 1 LabeledPoint(label, featureVector)

}


Splitting data• Training, cross-validation, and test• 80% of data for training and 10% each for cross-validation and test• Training and CV sets are used to choose a good setting of hyperparameters

for this data set• Test set is used to produce an unbiased evaluation of the expected

accuracy of a model built with those hyperparameters

val Array( trainData, cvData, testData) = data.randomSplit(Array(0.8, 0.1, 0.1))

trainData.cache()

cvData.cache()

testData.cache()


Building a DecisionTreeModel on the training set

• Building a DecisionTreeModel on the training set with some default arguments

• Compute some metrics about the resulting model using the CV set


Building a DecisionTreeModel on the training setimport org.apache.spark.mllib.evaluation._ import org.apache.spark.mllib.tree._ import org.apache.spark.mllib.tree.model._ import org.apache.spark.rdd._

def getMetrics(model: DecisionTreeModel, data: RDD[LabeledPoint]):

MulticlassMetrics = { val predictionsAndLabels = data.map(example

= > (model.predict(example.features), example.label))

new MulticlassMetrics(predictionsAndLabels) }

val model = DecisionTree.trainClassifier(trainData, 7, Map[Int, Int](), "gini", 4, 100)

val metrics = getMetrics(model, cvData)


Confusion matrix• 7 x 7 matrix

• The row number corresponds to an actual correct value

• The column number corresponds to a predicted valuemetrics.confusionMatrix

... 14019.0 6630.0 15.0 0.0 0.0 1.0 391.0 5413.0 22399.0 438.0 16.0 0.0 3.0 50.00.0 457.0 2999.0 73.0 0.0 12.0 0.0 0.0 1.0 163.0 117.0 0.0 0.0 0.0 0.0 872.0 40.0 0.0 0.0 0.0 0.0 0.0 500.0 1138.0 36.0 0.0 48.0 0.0 1091.0 41.0 0.0 0.0 0.0 0.0 891.0

metrics.precision…0.7030630195577938



8


Precision in multiclass metrics

• Binary classification• Positive vs. negative class• Precision is the fraction of examples that the classifier marked positive that are

actually positive• PPV = TP/(TP+FP)

• Recall is the fraction of all examples that are actually positive that the classifier marked positive• TPR = TP/P=TP/(TP+FN)

• Multiclass problem• Positive class vs. negative (all else)


Is 70% accuracy good?

• Classifier that classifies at random in proportion to its prevalence in the training set

• What is the baseline ?• A broken clock will be correct twice a day

• Randomly guessing a classification would also be occasionally correct


Decision Tree Hyperparameters [1/2]

• Hyperparameters• Values we have to choose by building models

• Maximum depth, maximum bins, and impurity measure

• Maximum depth• Limits the number of levels in the decision tree

• Useful to avoid overfitting the training data

• Maximum bins• feature <= value

• feature in (value1, value 2 ,…)

• A larger number of bins requires more processing time

• More optimal decision rule


Decision Tree Hyperparameters [2/2]

• Good rule should distinguish examples more meaningfully• Example• E.g. a rule that divides the Covtype set into only 1-3 category and 4-7 category

would be a great rule

• A good rule divides the training data’s target values into relatively homogeneous or “pure” subsets

• Minimizing the impurity of the two subsets


Gini impurity

• Gini impurity• Measuring impurity degree• Within a subset, it is the probability that a randomly chosen classification of a

randomly chosen example is incorrect• Includes the sum of products of proportions of classes

• If the subset contains only one class• The value is 0

IG (p) =1− pi2

i=1

N

∑

http://people.revoledu.com/kardi/tutorial/DecisionTree/how-to-measure-impurity.htm


Entropy

• Borrowed from information theory

• How much uncertainty does the collection of target values in the subset contain?

IE (p) = pi log(1 / p) =i=1

N

∑ − pi log(pi )i=1

N

∑



9


Tuning Decision Trees

• Spark tries a number of combinations of impurity measure, maximum

depth or number of bins and reports the results

val evaluations = for (impurity <- Array(" gini", "entropy");

depth <- Array( 1, 20); bins <- Array( 10, 300))

yield { val model = DecisionTree.trainClassifier(trainData, 7,

Map[Int, Int](), impurity, depth, bins) val predictionsAndLabels = cvData.map(example = >

(model.predict( example.features), example.label) ) val accuracy =

new MulticlassMetrics(predictionsAndLabels).precision ((impurity, depth, bins), accuracy)

} evaluations.sortBy(_._2). reverse.foreach( println) …


Tuning Decision Trees

• continued

(( entropy, 20,300), 0.9125545571245186) (( gini, 20,300), 0.9042533162173727) (( gini, 20,10), 0.8854428754813863) (( entropy, 20,10), 0.8848951647411211) (( gini, 1,300), 0.6358065896448438) (( gini, 1,10), 0.6355669661959777) (( entropy, 1,300), 0.4861446298673513) (( entropy, 1,10), 0.4861446298673513)


Categorical Features Revisited

• Map[Int, Int]()• Keys

• Indices of features in the input Vector• Values

• Distinct value counts

• Empty Map()• No features should be treated as categorical• All are numeric

• Numeric representation of categorical features• It can cause errors• The algorithm would be trying to learn from an ordering that has no meaning


Treating the categorical features with one-hot encoding

• Encodes the categorical features as several binary 0/1 values

• Any decision rule on the “numeric” features will choose thresholds between 0 and 1 • All are equivalent since all values are 0 or 1

• Considers the values of the underlying categorical feature individually• Increases memory usage


Converting one-hot encoding to 1-n encoding [1/3]

• 4 “wilderness” features• 40 “soil” features• Add derived features back to first 10

val data = rawData.map { line = > val values = line.split(','). map(_.toDouble) val wilderness = values.slice(10, 14).indexOf(1.0). toDoubleval soil = values.slice(14, 54).indexOf(1.0).toDoubleval featureVector = Vectors.dense(values.slice(0, 10) :+

wilderness :+ soil) val label = values.last - 1 LabeledPoint( label, featureVector)

}


• Specify value count for categorical features 10, 11• Causes these features to be treated as categorical

val evaluations = for (impurity <- Array(“gini", "entropy"); depth <- Array(

10, 20, 30); bins <- Array(40, 300)) yield {

val model = DecisionTree.trainClassifier(trainData,7,Map(10->4,11->40),

impurity, depth, bins) val trainAccuracy = getMetrics(model, trainData).

precision val cvAccuracy = getMetrics( model,cvData). precision ((impurity, depth, bins),(trainAccuracy,cvAccuracy)) }

Converting one-hot encoding to 1-n encoding [2/3]



10


• Tree-building process completes several times faster• By treating categorical features as categorical features, it

improves accuracy by almost 3%

(( entropy, 30,300),( 0.9996922984231909,0.9438383977425239)) (( entropy, 30,40),( 0.9994469978654548,0.938934581368939)) (( gini, 30,300),( 0.9998622874061833,0.937127912178671)) (( gini, 30,40),( 0.9995180059216415,0.9329467634811934)) (( entropy, 20,40),( 0.9725865867933623,0.9280773598540899)) (( gini, 20,300),( 0.9702347139020864,0.9249630062975326)) (( entropy, 20,300),( 0.9643948392205467,0.9231391307340239)) (( gini, 20,40),( 0.9679344832334917,0.9223820503114354)) (( gini, 10,300),( 0.7953203539213661,0.7946763481193434)) (( gini, 10,40),( 0.7880624698753701,0.7860215423792973))…

Converting one-hot encoding to 1-n encoding [3/3]11/4/2019 CS435 Introduction to Big Data – Fall 2019 W11.A.55

Does decision tree algorithm build the same tree every time?

• Over N values• There are 2N-2 possible decision rules

• Decision trees use several heuristics to narrow down the rules to be considered• The process of picking rules involves some randomness• Only a few features, picked at random, are looked at each time• Only values from a random subset of the training data are looked• Trades a bit of accuracy for a lot of speed

• Decision tree algorithm won’t build the same tree every time


RandomForest

• Number of trees to build• Here, 20

• “auto”• The strategy for choosing which features to evaluate at each level of the

tree• The random decision forest implementation will NOT even consider every

feature as the basis of a decision rule• Only a subset of all features

val forest = RandomForest.trainClassifier( trainData, 7, Map( 10 -> 4, 11 -> 40), 20, "auto", "entropy", 30, 300)


Making predictions

• The results of the DecisionTree and RandomForest training • DecisionTreeModel and RandomForestModel objects

• predict() method• Accepts a Vector object

• We can classify a new example by converting it to a feature vector in the same way and predicting its target class

val input = "2709,125,28,67,23,3224,253,207,61,6094,0,29" val vector = Vectors.dense(input.split(',').map(_.toDouble)) forest.predict(vector)


Questions?

CS435 Introduction to Big Data 11/4/2019 Week 11-A Fall ...cs435/slides/week11-A-6.pdfIn-Memory Cluster Computing: Apache Spark Software stack 11/4/2019 CS435 Introductionto Big Data

Documents