Top Banner
CS435 Introduction to Big Data Fall 2019 Colorado State University 8/28/2019 Week 1-B Sangmi Lee Pallickara 1 8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.0 Computer Science Department Picnic When: Saturday, August 31 st Time: 1pm-4pm Where: City Park Shelter #7 Welcome to the 2019-2020 Academic year ! Meet your faculty, department staff, and fellow students in a social setting. Food and drink will be provided. 8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.1 CS435 Introduction to Big Data PART 0. INTRODUCTION TO BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs435 8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.2 FAQs PA0 has been posted Sept. 5, 5:00PM via Canvas Individual submission (No team submission) Grading: Interview with GTA(s) TP0 Sept. 4, 5:00PM via Canvas Accommodation request, honor student Contact me by Sept. 6, 2019 No Laptop in the class Except the last row Readings Reading research papers Keshav's "How to read a paper” "How to Read and Understand a Scientific Paper: A Step-by-Step Guide for Non-Scientists" 8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.3 Topics Introduction to Big Data Analytics Data Collection, Sampling, and Preprocessing Introduction to MapReduce 8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.4 Part 0. Introduction Big Data Analytics -Data Collection, Sampling, and Preprocessing 8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.5 This Material is Built Based on, Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications, Bart Baesens, 2014, Wiley
7

Computer Science Department Picniccs435/slides/week1-B.pdf · 2019-08-28 · CS435 Introduction to Big Data Fall 2019 Colorado State University 8/28/2019 Week 1-B Sangmi Lee Pallickara

May 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Computer Science Department Picniccs435/slides/week1-B.pdf · 2019-08-28 · CS435 Introduction to Big Data Fall 2019 Colorado State University 8/28/2019 Week 1-B Sangmi Lee Pallickara

CS435 Introduction to Big DataFall 2019 Colorado State University

8/28/2019 Week 1-BSangmi Lee Pallickara

1

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.0

Computer Science Department Picnic

When: Saturday, August 31st

Time: 1pm-4pmWhere: City Park Shelter #7

Welcome to the 2019-2020Academic year !

Meet your faculty, departmentstaff, and fellow students in asocial setting. Food and drinkwill be provided.

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.1

CS435 Introduction to Big Data

PART 0. INTRODUCTION TO BIG DATA

Sangmi Lee Pallickara

Computer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs435

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.2

FAQs• PA0 has been posted

• Sept. 5, 5:00PM via Canvas • Individual submission (No team submission)• Grading: Interview with GTA(s)

• TP0• Sept. 4, 5:00PM via Canvas

• Accommodation request, honor student• Contact me by Sept. 6, 2019

• No Laptop in the class• Except the last row

• Readings• Reading research papers• Keshav's "How to read a paper”• "How to Read and Understand a Scientific Paper: A Step-by-Step Guide for Non-Scientists"

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.3

Topics

• Introduction to Big Data Analytics• Data Collection, Sampling, and Preprocessing

• Introduction to MapReduce

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.4

Part 0. Introduction

Big Data Analytics-Data Collection, Sampling, and Preprocessing

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.5

This Material is Built Based on,

• Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications, Bart Baesens, 2014, Wiley

Page 2: Computer Science Department Picniccs435/slides/week1-B.pdf · 2019-08-28 · CS435 Introduction to Big Data Fall 2019 Colorado State University 8/28/2019 Week 1-B Sangmi Lee Pallickara

CS435 Introduction to Big DataFall 2019 Colorado State University

8/28/2019 Week 1-BSangmi Lee Pallickara

2

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.6

Analytics Process Model

The most time-consuming step is the data selection and preprocessing step

- This is usually around 80% of the total time needed to build an analytical model

Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications, Bart Baesens, 2014, Wiley

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.7

Types of Analytics

• Analytics is a term that is often used interchangeably with • Data science• Data mining• Knowledge discovery

• Descriptive analytics• No target variable• e.g. Clustering, association rules

• Predictive analytics• A target variable is typically available• E.g. linear/logistic regression, decision trees, neural networks, support vector machines

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.8

Types of Data Sources• Transactions

• Structured, low-level, detailed information• Customer transactions

• Purchase, claim, cash transfer, credit card payment• Stored in massive online transaction processing (OLTP) relational database• Can be summarized over longer time horizons (e.g. averages, relative trends, Max/Min values)

• Unstructured data embedded in text documents• emails, web pages, claim forms,• Requires extensive preprocessing

• Qualitative, expert-based data • Requires subject matter experts’ (SME) analysis• Scientific data

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.9

Sampling• Taking a subset of data for analytics• Generating hypothesis• Model selection• Feature selection• Speculative process• Building analytics model

• Stratified sampling• Taking samples according to predefined strata• e.g. Fraud detection with very skewed (99 percent non-fraud customers, 1 percent

fraud customers) • Sample should contain the same percentage of fraud customers as in the original data

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.10

Types of Data Elements

• Continuous• Data elements that are defined on an interval that can be limited or unlimited

• e.g. income, sales, temperature

• Categorical Nominal• Data elements that can only take on a limited set of values with no meaningful ordering between

them• e.g. marital status, profession, purpose of loan

• Ordinal• Data elements that can only take on a limited set of values with a meaningful ordering between

them• e.g. credit rating, age coded as young, middle age and old

• Binary• Data elements that can only take on two values

• e.g. Having child, allowed to drive

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.11

Missing Values• Missing values can occur because of various

reasons• The information can be non-applicable• The information can be undisclosed• The information can be unavailable

Page 3: Computer Science Department Picniccs435/slides/week1-B.pdf · 2019-08-28 · CS435 Introduction to Big Data Fall 2019 Colorado State University 8/28/2019 Week 1-B Sangmi Lee Pallickara

CS435 Introduction to Big DataFall 2019 Colorado State University

8/28/2019 Week 1-BSangmi Lee Pallickara

3

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.12

Missing Values --continued• Replace (impute)

• Replaces the missing value with a computed/selected value• Imputation algorithm examples

• Hot-deck: replaces with a randomly selected similar records• Cold-deck: selects replacement from another dataset• Mean substitution: replaces with the mean of that variable for all other cases• Regression: predicts missing values of a variable based on other variables.

• Delete• Deletes observations with lots of missing values• This assumes that information is missing at random and has no meaningful interpretation

and/or relationship to the target• Keep

• Missing values can be meaningful• e.g. a customer did not disclose the income for current condition

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.13

Outliers of Dataset

• Outliers are extreme observations that are very dissimilar to the rest of the population• Valid observation

• Salary of boss

• Invalid observation• Age is 300

• Multivariate outliers• Observations that are outlying in multiple dimensions

• e.g: Temperature in Fort Collins is 100 degrees but on a midnight in December

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.14

Identifying Outliers using Box Plots

• A box plot represents three key quartiles of the data• Q1: 25% of the observations have a lower value• Q2: 50% of the observations have a lower value• Q3: 75% of the observations have a lower value• The minimum and maximum values are added

• Too far away is now quantified as more than 1.5 x Interquartile Range (IQR = (Q3 – Q1) )

Q3MQ1

Outliers

1.5 x IQR

Min

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.15

Identifying Outliers using Z-Score

• Measuring how many standard deviations an observation is away from the mean• !" = $%&'

(where μ represents the average of the variable and σ its standard deviation

• A practical rule of thumb then defines outliers when the absolute value of the z-score | z | is bigger than 3

ID Age Z-Score

1 30 (30-40)/10=-1

2 50 (50-40)/10=+1

3 10 (10-40)/10=-3

4 40 (40-40)/10=0

5 60 (60-40)/10=+2

6 80 (80-40)/10=+4

-- .. …

μ = 40σ = 10

μ = 0σ = 1

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.16

Dealing with Outliers• Treat outliers as missing values• Popular schemes• Truncation

• Taking only values that are within the limits• Winsorizing

• Limiting extreme values to reduce the effect of possible spurious outliers

• {92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, -40, 101, 86, 85, 15, 89, 89, 28, -5, 41} (N = 20, mean = 101.5)à {92, 19, 101, 58, 101, 91, 26, 78, 10, 13, -5, 101, 86, 85, 15, 89, 89, 28, -5, 41} (N = 20, mean = 55.65)

Using the Z-Scores for truncation

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.17

Standardizing Data

• Scaling variables to a similar range• e.g. two variables: education and income• Elementary school (1), middle school (2), high school (3), college (4), graduate

school (5)• Income: 0 ~ $5M• When building logistic regression models, the coefficient for education might

become very small.

• Min/Max standardization• !"#$ = &'()*+,- &'()

+./ &'() *+,- &'() "#$012 − "#$04" + "#$04"• Where newmax and newmin are the newly imposed maximum and minimum (e.g. 1

and 0)

Page 4: Computer Science Department Picniccs435/slides/week1-B.pdf · 2019-08-28 · CS435 Introduction to Big Data Fall 2019 Colorado State University 8/28/2019 Week 1-B Sangmi Lee Pallickara

CS435 Introduction to Big DataFall 2019 Colorado State University

8/28/2019 Week 1-BSangmi Lee Pallickara

4

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.18

Standardizing Data. -- continued

• Z-Score based• Calculate the z-scores

• Decimal scaling• !"#$ = &'()

*+,• Dividing by a power of 10

• Standardization is useful for regression-based approaches• It is not needed for decision trees

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.19

Part 0. Introduction

Big Data Analytics-Big Data Technology Stack

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.20 8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.21

In a nutshell

Data LayerApache HDFS, Amazon AWS’s S3, IBM GPFS, Microsoft Azure

Data Processing LayerApache Hadoop MapReduce, Pig, Apache Spark, Cassandra, Storm, Mahout, MLLib,

Data Integration LayerApache Flume, Apache Kafka, Apache Sqoop

Operations and Scheduling LayerApache AmbariApache Oozie

Apache Zookeeper

Data Presentation LayerApache Kibana

Security and Governance

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.22

Part 1. Large Scale Data Analytics

Introduction to MapReduce

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.23

This material is developed based on,• Anand Rajaraman, Jure Leskovec, and Jeffrey

Ullman, “Mining of Massive Datasets”, Cambridge University Press, 2012 --Chapter 2• Download this chapter from the CS435 schedule

page

• Hadoop: The definitive Guide, Tom White, O’Reilly, 3rd Edition, 2014

• MapReduce Design Patterns, Donald Miner and Adam Shook, O’Reilly, 2013

Page 5: Computer Science Department Picniccs435/slides/week1-B.pdf · 2019-08-28 · CS435 Introduction to Big Data Fall 2019 Colorado State University 8/28/2019 Week 1-B Sangmi Lee Pallickara

CS435 Introduction to Big DataFall 2019 Colorado State University

8/28/2019 Week 1-BSangmi Lee Pallickara

5

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.24

What is MapReduce?

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.25

MapReduce [1/2]

• MapReduce is inspired by the concepts of map and reduce in Lisp.

• “Modern”MapReduce• Developed within Google as a mechanism for processing large amounts of raw data.

• Crawled documents or web request logs• Distributes these data across thousands of machines

• Same computations are performed on each CPU with different dataset

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.26

Sorting 1TB of numbers with a single machine

• What will be the challenges?

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.27

Sorting 1TB of numbers with multiple machines

• What will be the requirements to perform this sorting?

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.28

MapReduce [2/2]

• MapReduce provides an abstraction that allows engineers to perform

simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.29

Mapper

• Mapper maps input key/value pairs to a set of intermediate key/value pairs• Maps are the individual tasks that transform input records into intermediate

records

• The transformed intermediate records do not need to be of the same type as the input records

• A given input pair may map to zero or many output pairs

• The Hadoop MapReduce framework spawns one map task for each InputSplitgenerated by the InputFormat for the job

Page 6: Computer Science Department Picniccs435/slides/week1-B.pdf · 2019-08-28 · CS435 Introduction to Big Data Fall 2019 Colorado State University 8/28/2019 Week 1-B Sangmi Lee Pallickara

CS435 Introduction to Big DataFall 2019 Colorado State University

8/28/2019 Week 1-BSangmi Lee Pallickara

6

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.30

Reducer

• Reducer reduces a set of intermediate values which share a key to a smaller set

of values

• Reducer has 3 primary phases

• Shuffle, sort and reduce

• Shuffle• Input to the reducer is the sorted output of the mappers

• The framework fetches the relevant partition of the output of all the mappers via HTTP

• Sort• The framework groups input to the reducer by keys

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.31

MapReduce Example 1

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.32

Example 1: WordCount [1/5]

• For text files stored under usr/joe/wordcount/input, count the number of occurrences of each word• How do files and directory look?

$ bin/hadoop dfs -ls /usr/joe/wordcount/input/ /usr/joe/wordcount/input/file01 /usr/joe/wordcount/input/file02

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01 Hello World, Bye World!

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop, Goodbye to hadoop.

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.33

Example 1: WordCount [2/5]

• Run the MapReduce application$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount/usr/joe/wordcount/input /usr/joe/wordcount/output

$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000 Bye 1 Goodbye 1 Hadoop, 1 Hello 2 World! 1 World, 1 hadoop. 1 to 1

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.34

Example 1: WordCount [3/5]

Mappers1. Read a line2. Tokenize the string3. Pass the

<key,value> output to the reducer

Reducers1. Collect <key,value> pairs

sharing same key2. Aggregate total number of

occurrences

What do you have to pass from the Mappers?

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.35

Example 1: WordCount [4/5]

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());context.write(word, one);

}}

}

Page 7: Computer Science Department Picniccs435/slides/week1-B.pdf · 2019-08-28 · CS435 Introduction to Big Data Fall 2019 Colorado State University 8/28/2019 Week 1-B Sangmi Lee Pallickara

CS435 Introduction to Big DataFall 2019 Colorado State University

8/28/2019 Week 1-BSangmi Lee Pallickara

7

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.36

Example 1: WordCount [5/5]

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {

sum += val.get();}context.write(key, new IntWritable(sum));

}}

8/28/2019 CS435 Introduction to Big Data – Fall 2019 W1.B.37

Questions?