Computer Science Department Picniccs435/slides/week1-B.pdf · 2019-08-28 · CS435 Introduction to Big Data Fall 2019 Colorado State University 8/28/2019 Week 1-B Sangmi Lee Pallickara

CS435 Introduction to Big DataFall 2019 Colorado State University

8/28/2019 Week 1-BSangmi Lee Pallickara

1

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.0

Computer Science Department Picnic

When: Saturday, August 31st

Time: 1pm-4pmWhere: City Park Shelter #7

Welcome to the 2019-2020Academic year !

Meet your faculty, departmentstaff, and fellow students in asocial setting. Food and drinkwill be provided.


CS435 Introduction to Big Data

PART 0. INTRODUCTION TO BIG DATA

Sangmi Lee Pallickara

Computer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs435


FAQs• PA0 has been posted

• Sept. 5, 5:00PM via Canvas • Individual submission (No team submission)• Grading: Interview with GTA(s)

• TP0• Sept. 4, 5:00PM via Canvas

• Accommodation request, honor student• Contact me by Sept. 6, 2019

• No Laptop in the class• Except the last row

• Readings• Reading research papers• Keshav's "How to read a paper”• "How to Read and Understand a Scientific Paper: A Step-by-Step Guide for Non-Scientists"


Topics

• Introduction to Big Data Analytics• Data Collection, Sampling, and Preprocessing

• Introduction to MapReduce


Part 0. Introduction

Big Data Analytics-Data Collection, Sampling, and Preprocessing


This Material is Built Based on,

• Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications, Bart Baesens, 2014, Wiley



2


Analytics Process Model

The most time-consuming step is the data selection and preprocessing step

- This is usually around 80% of the total time needed to build an analytical model

Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications, Bart Baesens, 2014, Wiley


Types of Analytics

• Analytics is a term that is often used interchangeably with • Data science• Data mining• Knowledge discovery

• Descriptive analytics• No target variable• e.g. Clustering, association rules

• Predictive analytics• A target variable is typically available• E.g. linear/logistic regression, decision trees, neural networks, support vector machines


Types of Data Sources• Transactions

• Structured, low-level, detailed information• Customer transactions

• Purchase, claim, cash transfer, credit card payment• Stored in massive online transaction processing (OLTP) relational database• Can be summarized over longer time horizons (e.g. averages, relative trends, Max/Min values)

• Unstructured data embedded in text documents• emails, web pages, claim forms,• Requires extensive preprocessing

• Qualitative, expert-based data • Requires subject matter experts’ (SME) analysis• Scientific data


Sampling• Taking a subset of data for analytics• Generating hypothesis• Model selection• Feature selection• Speculative process• Building analytics model

• Stratified sampling• Taking samples according to predefined strata• e.g. Fraud detection with very skewed (99 percent non-fraud customers, 1 percent

fraud customers) • Sample should contain the same percentage of fraud customers as in the original data


Types of Data Elements

• Continuous• Data elements that are defined on an interval that can be limited or unlimited

• e.g. income, sales, temperature

• Categorical Nominal• Data elements that can only take on a limited set of values with no meaningful ordering between

them• e.g. marital status, profession, purpose of loan

• Ordinal• Data elements that can only take on a limited set of values with a meaningful ordering between

them• e.g. credit rating, age coded as young, middle age and old

• Binary• Data elements that can only take on two values

• e.g. Having child, allowed to drive


Missing Values• Missing values can occur because of various

reasons• The information can be non-applicable• The information can be undisclosed• The information can be unavailable



3


Missing Values --continued• Replace (impute)

• Replaces the missing value with a computed/selected value• Imputation algorithm examples

• Hot-deck: replaces with a randomly selected similar records• Cold-deck: selects replacement from another dataset• Mean substitution: replaces with the mean of that variable for all other cases• Regression: predicts missing values of a variable based on other variables.

• Delete• Deletes observations with lots of missing values• This assumes that information is missing at random and has no meaningful interpretation

and/or relationship to the target• Keep

• Missing values can be meaningful• e.g. a customer did not disclose the income for current condition


Outliers of Dataset

• Outliers are extreme observations that are very dissimilar to the rest of the population• Valid observation

• Salary of boss

• Invalid observation• Age is 300

• Multivariate outliers• Observations that are outlying in multiple dimensions

• e.g: Temperature in Fort Collins is 100 degrees but on a midnight in December


Identifying Outliers using Box Plots

• A box plot represents three key quartiles of the data• Q1: 25% of the observations have a lower value• Q2: 50% of the observations have a lower value• Q3: 75% of the observations have a lower value• The minimum and maximum values are added

• Too far away is now quantified as more than 1.5 x Interquartile Range (IQR = (Q3 – Q1) )

Q3MQ1

Outliers

1.5 x IQR

Min


Identifying Outliers using Z-Score

• Measuring how many standard deviations an observation is away from the mean• !" = $%&'

(where μ represents the average of the variable and σ its standard deviation

• A practical rule of thumb then defines outliers when the absolute value of the z-score | z | is bigger than 3

ID Age Z-Score

1 30 (30-40)/10=-1

2 50 (50-40)/10=+1

3 10 (10-40)/10=-3

4 40 (40-40)/10=0

5 60 (60-40)/10=+2

6 80 (80-40)/10=+4

-- .. …

μ = 40σ = 10

μ = 0σ = 1


Dealing with Outliers• Treat outliers as missing values• Popular schemes• Truncation

• Taking only values that are within the limits• Winsorizing

• Limiting extreme values to reduce the effect of possible spurious outliers

• {92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, -40, 101, 86, 85, 15, 89, 89, 28, -5, 41} (N = 20, mean = 101.5)à {92, 19, 101, 58, 101, 91, 26, 78, 10, 13, -5, 101, 86, 85, 15, 89, 89, 28, -5, 41} (N = 20, mean = 55.65)

Using the Z-Scores for truncation


Standardizing Data

• Scaling variables to a similar range• e.g. two variables: education and income• Elementary school (1), middle school (2), high school (3), college (4), graduate

school (5)• Income: 0 ~ $5M• When building logistic regression models, the coefficient for education might

become very small.

• Min/Max standardization• !"#$ = &'()*+,- &'()

+./ &'() *+,- &'() "#$012 − "#$04" + "#$04"• Where newmax and newmin are the newly imposed maximum and minimum (e.g. 1

and 0)



4


Standardizing Data. -- continued

• Z-Score based• Calculate the z-scores

• Decimal scaling• !"#$ = &'()

*+,• Dividing by a power of 10

• Standardization is useful for regression-based approaches• It is not needed for decision trees


Part 0. Introduction

Big Data Analytics-Big Data Technology Stack

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.20 8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.21

In a nutshell

Data LayerApache HDFS, Amazon AWS’s S3, IBM GPFS, Microsoft Azure

Data Processing LayerApache Hadoop MapReduce, Pig, Apache Spark, Cassandra, Storm, Mahout, MLLib,

Data Integration LayerApache Flume, Apache Kafka, Apache Sqoop

Operations and Scheduling LayerApache AmbariApache Oozie

Apache Zookeeper

Data Presentation LayerApache Kibana

Security and Governance


Part 1. Large Scale Data Analytics

Introduction to MapReduce


This material is developed based on,• Anand Rajaraman, Jure Leskovec, and Jeffrey

Ullman, “Mining of Massive Datasets”, Cambridge University Press, 2012 --Chapter 2• Download this chapter from the CS435 schedule

page

• Hadoop: The definitive Guide, Tom White, O’Reilly, 3rd Edition, 2014

• MapReduce Design Patterns, Donald Miner and Adam Shook, O’Reilly, 2013



5


What is MapReduce?


MapReduce [1/2]

• MapReduce is inspired by the concepts of map and reduce in Lisp.

• “Modern”MapReduce• Developed within Google as a mechanism for processing large amounts of raw data.

• Crawled documents or web request logs• Distributes these data across thousands of machines

• Same computations are performed on each CPU with different dataset


Sorting 1TB of numbers with a single machine

• What will be the challenges?


Sorting 1TB of numbers with multiple machines

• What will be the requirements to perform this sorting?


MapReduce [2/2]

• MapReduce provides an abstraction that allows engineers to perform

simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance


Mapper

• Mapper maps input key/value pairs to a set of intermediate key/value pairs• Maps are the individual tasks that transform input records into intermediate

records

• The transformed intermediate records do not need to be of the same type as the input records

• A given input pair may map to zero or many output pairs

• The Hadoop MapReduce framework spawns one map task for each InputSplitgenerated by the InputFormat for the job



6


Reducer

• Reducer reduces a set of intermediate values which share a key to a smaller set

of values

• Reducer has 3 primary phases

• Shuffle, sort and reduce

• Shuffle• Input to the reducer is the sorted output of the mappers

• The framework fetches the relevant partition of the output of all the mappers via HTTP

• Sort• The framework groups input to the reducer by keys


MapReduce Example 1


Example 1: WordCount [1/5]

• For text files stored under usr/joe/wordcount/input, count the number of occurrences of each word• How do files and directory look?

$ bin/hadoop dfs -ls /usr/joe/wordcount/input/ /usr/joe/wordcount/input/file01 /usr/joe/wordcount/input/file02

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01 Hello World, Bye World!

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop, Goodbye to hadoop.



• Run the MapReduce application$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount/usr/joe/wordcount/input /usr/joe/wordcount/output

$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000 Bye 1 Goodbye 1 Hadoop, 1 Hello 2 World! 1 World, 1 hadoop. 1 to 1



Mappers1. Read a line2. Tokenize the string3. Pass the

<key,value> output to the reducer

Reducers1. Collect <key,value> pairs

sharing same key2. Aggregate total number of

occurrences

What do you have to pass from the Mappers?



public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());context.write(word, one);

}}

}



7



public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {

sum += val.get();}context.write(key, new IntWritable(sum));

}}


Questions?

Computer Science Department Picniccs435/slides/week1-B.pdf · 2019-08-28 · CS435 Introduction to Big Data Fall 2019 Colorado State University 8/28/2019 Week 1-B Sangmi Lee Pallickara

Documents