Top Banner
CSE 5243 INTRO. TO DATA MINING Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han Data & Data Preprocessing & Classification (Basic Concepts) Huan Sun, CSE@The Ohio State University 09/05/2017
53

CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

Oct 11, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

CSE 5243 INTRO TO DATA MINING

Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

Data amp Data Preprocessing amp Classification (Basic Concepts)

Huan Sun CSEThe Ohio State University 09052017

2

Chapter 3 Data Preprocessing

Data Preprocessing An Overview

Data Cleaning

Data Integration

Data Reduction and Transformation

Dimensionality Reduction

Summary

3

Data Transformation

A function that maps the entire set of values of a given attribute to a new set of replacement values st each old value can be identified with one of the new values

4

Data Transformation

A function that maps the entire set of values of a given attribute to a new set of replacement values st each old value can be identified with one of the new values

Methods

Smoothing Remove noise from data

Attributefeature construction New attributes constructed from the given ones

Aggregation Summarization data cube construction

Normalization Scaled to fall within a smaller specified range min-max normalization z-score normalization normalization by decimal scaling

Discretization Concept hierarchy climbing

5

Normalization

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

71600)001(00012000980001260073

=+minusminusminus

6

Normalization

Min-max normalization to [new_minA new_maxA]

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

A

Avvσmicrominus

= Z-score The distance between the raw score and the population mean in the unit of the standard deviation

225100016

0005460073=

minus

7

Normalization

Min-max normalization to [new_minA new_maxA]

Z-score normalization (μ mean σ standard deviation)

Normalization by decimal scaling

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

A

Avvσmicrominus

= Z-score The distance between the raw score and the population mean in the unit of the standard deviation

Where j is the smallest integer such that Max(|νrsquo|) lt 1

8

Discretization

Three types of attributes Nominalmdashvalues from an unordered set eg color profession Ordinalmdashvalues from an ordered set eg military or academic rank Numericmdashreal numbers eg integer or real numbers

Discretization Divide the range of a continuous attribute into intervals Interval labels can then be used to replace actual data values Reduce data size by discretization Supervised vs unsupervised Split (top-down) vs merge (bottom-up) Discretization can be performed recursively on an attribute Prepare for further analysis eg classification

9

Data Discretization Methods

Binning Top-down split unsupervised

Histogram analysis Top-down split unsupervised

Clustering analysis Unsupervised top-down split or bottom-up merge

Decision-tree analysis Supervised top-down split

Correlation (eg χ2) analysis Unsupervised bottom-up merge

Note All the methods can be applied recursively

10

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

11

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals each containing approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky

12

Example Binning Methods for Data Smoothing

Sorted data for price (in dollars) 4 8 9 15 21 21 24 25 26 28 29 34 Partition into equal-frequency (equi-width) bins

- Bin 1 4 8 9 15- Bin 2 21 21 24 25- Bin 3 26 28 29 34

Smoothing by bin means- Bin 1 9 9 9 9- Bin 2 23 23 23 23- Bin 3 29 29 29 29

Smoothing by bin boundaries- Bin 1 4 4 4 15- Bin 2 21 21 25 25- Bin 3 26 26 26 34

13

Discretization by Classification amp Correlation Analysis

Classification (eg decision tree analysis)

Supervised Given class labels eg cancerous vs benign

Using entropy to determine split point (discretization point)

Top-down recursive split

Details to be covered in ldquoClassificationrdquo sessions

14

Chapter 3 Data Preprocessing

Data Preprocessing An Overview

Data Cleaning

Data Integration

Data Reduction and Transformation

Dimensionality Reduction

Summary

15

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

16

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set

of principal variables

17

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis becomes less

meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set of principal

variables

Advantages of dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 2: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

2

Chapter 3 Data Preprocessing

Data Preprocessing An Overview

Data Cleaning

Data Integration

Data Reduction and Transformation

Dimensionality Reduction

Summary

3

Data Transformation

A function that maps the entire set of values of a given attribute to a new set of replacement values st each old value can be identified with one of the new values

4

Data Transformation

A function that maps the entire set of values of a given attribute to a new set of replacement values st each old value can be identified with one of the new values

Methods

Smoothing Remove noise from data

Attributefeature construction New attributes constructed from the given ones

Aggregation Summarization data cube construction

Normalization Scaled to fall within a smaller specified range min-max normalization z-score normalization normalization by decimal scaling

Discretization Concept hierarchy climbing

5

Normalization

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

71600)001(00012000980001260073

=+minusminusminus

6

Normalization

Min-max normalization to [new_minA new_maxA]

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

A

Avvσmicrominus

= Z-score The distance between the raw score and the population mean in the unit of the standard deviation

225100016

0005460073=

minus

7

Normalization

Min-max normalization to [new_minA new_maxA]

Z-score normalization (μ mean σ standard deviation)

Normalization by decimal scaling

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

A

Avvσmicrominus

= Z-score The distance between the raw score and the population mean in the unit of the standard deviation

Where j is the smallest integer such that Max(|νrsquo|) lt 1

8

Discretization

Three types of attributes Nominalmdashvalues from an unordered set eg color profession Ordinalmdashvalues from an ordered set eg military or academic rank Numericmdashreal numbers eg integer or real numbers

Discretization Divide the range of a continuous attribute into intervals Interval labels can then be used to replace actual data values Reduce data size by discretization Supervised vs unsupervised Split (top-down) vs merge (bottom-up) Discretization can be performed recursively on an attribute Prepare for further analysis eg classification

9

Data Discretization Methods

Binning Top-down split unsupervised

Histogram analysis Top-down split unsupervised

Clustering analysis Unsupervised top-down split or bottom-up merge

Decision-tree analysis Supervised top-down split

Correlation (eg χ2) analysis Unsupervised bottom-up merge

Note All the methods can be applied recursively

10

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

11

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals each containing approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky

12

Example Binning Methods for Data Smoothing

Sorted data for price (in dollars) 4 8 9 15 21 21 24 25 26 28 29 34 Partition into equal-frequency (equi-width) bins

- Bin 1 4 8 9 15- Bin 2 21 21 24 25- Bin 3 26 28 29 34

Smoothing by bin means- Bin 1 9 9 9 9- Bin 2 23 23 23 23- Bin 3 29 29 29 29

Smoothing by bin boundaries- Bin 1 4 4 4 15- Bin 2 21 21 25 25- Bin 3 26 26 26 34

13

Discretization by Classification amp Correlation Analysis

Classification (eg decision tree analysis)

Supervised Given class labels eg cancerous vs benign

Using entropy to determine split point (discretization point)

Top-down recursive split

Details to be covered in ldquoClassificationrdquo sessions

14

Chapter 3 Data Preprocessing

Data Preprocessing An Overview

Data Cleaning

Data Integration

Data Reduction and Transformation

Dimensionality Reduction

Summary

15

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

16

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set

of principal variables

17

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis becomes less

meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set of principal

variables

Advantages of dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 3: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

3

Data Transformation

A function that maps the entire set of values of a given attribute to a new set of replacement values st each old value can be identified with one of the new values

4

Data Transformation

A function that maps the entire set of values of a given attribute to a new set of replacement values st each old value can be identified with one of the new values

Methods

Smoothing Remove noise from data

Attributefeature construction New attributes constructed from the given ones

Aggregation Summarization data cube construction

Normalization Scaled to fall within a smaller specified range min-max normalization z-score normalization normalization by decimal scaling

Discretization Concept hierarchy climbing

5

Normalization

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

71600)001(00012000980001260073

=+minusminusminus

6

Normalization

Min-max normalization to [new_minA new_maxA]

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

A

Avvσmicrominus

= Z-score The distance between the raw score and the population mean in the unit of the standard deviation

225100016

0005460073=

minus

7

Normalization

Min-max normalization to [new_minA new_maxA]

Z-score normalization (μ mean σ standard deviation)

Normalization by decimal scaling

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

A

Avvσmicrominus

= Z-score The distance between the raw score and the population mean in the unit of the standard deviation

Where j is the smallest integer such that Max(|νrsquo|) lt 1

8

Discretization

Three types of attributes Nominalmdashvalues from an unordered set eg color profession Ordinalmdashvalues from an ordered set eg military or academic rank Numericmdashreal numbers eg integer or real numbers

Discretization Divide the range of a continuous attribute into intervals Interval labels can then be used to replace actual data values Reduce data size by discretization Supervised vs unsupervised Split (top-down) vs merge (bottom-up) Discretization can be performed recursively on an attribute Prepare for further analysis eg classification

9

Data Discretization Methods

Binning Top-down split unsupervised

Histogram analysis Top-down split unsupervised

Clustering analysis Unsupervised top-down split or bottom-up merge

Decision-tree analysis Supervised top-down split

Correlation (eg χ2) analysis Unsupervised bottom-up merge

Note All the methods can be applied recursively

10

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

11

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals each containing approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky

12

Example Binning Methods for Data Smoothing

Sorted data for price (in dollars) 4 8 9 15 21 21 24 25 26 28 29 34 Partition into equal-frequency (equi-width) bins

- Bin 1 4 8 9 15- Bin 2 21 21 24 25- Bin 3 26 28 29 34

Smoothing by bin means- Bin 1 9 9 9 9- Bin 2 23 23 23 23- Bin 3 29 29 29 29

Smoothing by bin boundaries- Bin 1 4 4 4 15- Bin 2 21 21 25 25- Bin 3 26 26 26 34

13

Discretization by Classification amp Correlation Analysis

Classification (eg decision tree analysis)

Supervised Given class labels eg cancerous vs benign

Using entropy to determine split point (discretization point)

Top-down recursive split

Details to be covered in ldquoClassificationrdquo sessions

14

Chapter 3 Data Preprocessing

Data Preprocessing An Overview

Data Cleaning

Data Integration

Data Reduction and Transformation

Dimensionality Reduction

Summary

15

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

16

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set

of principal variables

17

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis becomes less

meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set of principal

variables

Advantages of dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 4: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

4

Data Transformation

A function that maps the entire set of values of a given attribute to a new set of replacement values st each old value can be identified with one of the new values

Methods

Smoothing Remove noise from data

Attributefeature construction New attributes constructed from the given ones

Aggregation Summarization data cube construction

Normalization Scaled to fall within a smaller specified range min-max normalization z-score normalization normalization by decimal scaling

Discretization Concept hierarchy climbing

5

Normalization

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

71600)001(00012000980001260073

=+minusminusminus

6

Normalization

Min-max normalization to [new_minA new_maxA]

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

A

Avvσmicrominus

= Z-score The distance between the raw score and the population mean in the unit of the standard deviation

225100016

0005460073=

minus

7

Normalization

Min-max normalization to [new_minA new_maxA]

Z-score normalization (μ mean σ standard deviation)

Normalization by decimal scaling

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

A

Avvσmicrominus

= Z-score The distance between the raw score and the population mean in the unit of the standard deviation

Where j is the smallest integer such that Max(|νrsquo|) lt 1

8

Discretization

Three types of attributes Nominalmdashvalues from an unordered set eg color profession Ordinalmdashvalues from an ordered set eg military or academic rank Numericmdashreal numbers eg integer or real numbers

Discretization Divide the range of a continuous attribute into intervals Interval labels can then be used to replace actual data values Reduce data size by discretization Supervised vs unsupervised Split (top-down) vs merge (bottom-up) Discretization can be performed recursively on an attribute Prepare for further analysis eg classification

9

Data Discretization Methods

Binning Top-down split unsupervised

Histogram analysis Top-down split unsupervised

Clustering analysis Unsupervised top-down split or bottom-up merge

Decision-tree analysis Supervised top-down split

Correlation (eg χ2) analysis Unsupervised bottom-up merge

Note All the methods can be applied recursively

10

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

11

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals each containing approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky

12

Example Binning Methods for Data Smoothing

Sorted data for price (in dollars) 4 8 9 15 21 21 24 25 26 28 29 34 Partition into equal-frequency (equi-width) bins

- Bin 1 4 8 9 15- Bin 2 21 21 24 25- Bin 3 26 28 29 34

Smoothing by bin means- Bin 1 9 9 9 9- Bin 2 23 23 23 23- Bin 3 29 29 29 29

Smoothing by bin boundaries- Bin 1 4 4 4 15- Bin 2 21 21 25 25- Bin 3 26 26 26 34

13

Discretization by Classification amp Correlation Analysis

Classification (eg decision tree analysis)

Supervised Given class labels eg cancerous vs benign

Using entropy to determine split point (discretization point)

Top-down recursive split

Details to be covered in ldquoClassificationrdquo sessions

14

Chapter 3 Data Preprocessing

Data Preprocessing An Overview

Data Cleaning

Data Integration

Data Reduction and Transformation

Dimensionality Reduction

Summary

15

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

16

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set

of principal variables

17

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis becomes less

meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set of principal

variables

Advantages of dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 5: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

5

Normalization

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

71600)001(00012000980001260073

=+minusminusminus

6

Normalization

Min-max normalization to [new_minA new_maxA]

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

A

Avvσmicrominus

= Z-score The distance between the raw score and the population mean in the unit of the standard deviation

225100016

0005460073=

minus

7

Normalization

Min-max normalization to [new_minA new_maxA]

Z-score normalization (μ mean σ standard deviation)

Normalization by decimal scaling

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

A

Avvσmicrominus

= Z-score The distance between the raw score and the population mean in the unit of the standard deviation

Where j is the smallest integer such that Max(|νrsquo|) lt 1

8

Discretization

Three types of attributes Nominalmdashvalues from an unordered set eg color profession Ordinalmdashvalues from an ordered set eg military or academic rank Numericmdashreal numbers eg integer or real numbers

Discretization Divide the range of a continuous attribute into intervals Interval labels can then be used to replace actual data values Reduce data size by discretization Supervised vs unsupervised Split (top-down) vs merge (bottom-up) Discretization can be performed recursively on an attribute Prepare for further analysis eg classification

9

Data Discretization Methods

Binning Top-down split unsupervised

Histogram analysis Top-down split unsupervised

Clustering analysis Unsupervised top-down split or bottom-up merge

Decision-tree analysis Supervised top-down split

Correlation (eg χ2) analysis Unsupervised bottom-up merge

Note All the methods can be applied recursively

10

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

11

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals each containing approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky

12

Example Binning Methods for Data Smoothing

Sorted data for price (in dollars) 4 8 9 15 21 21 24 25 26 28 29 34 Partition into equal-frequency (equi-width) bins

- Bin 1 4 8 9 15- Bin 2 21 21 24 25- Bin 3 26 28 29 34

Smoothing by bin means- Bin 1 9 9 9 9- Bin 2 23 23 23 23- Bin 3 29 29 29 29

Smoothing by bin boundaries- Bin 1 4 4 4 15- Bin 2 21 21 25 25- Bin 3 26 26 26 34

13

Discretization by Classification amp Correlation Analysis

Classification (eg decision tree analysis)

Supervised Given class labels eg cancerous vs benign

Using entropy to determine split point (discretization point)

Top-down recursive split

Details to be covered in ldquoClassificationrdquo sessions

14

Chapter 3 Data Preprocessing

Data Preprocessing An Overview

Data Cleaning

Data Integration

Data Reduction and Transformation

Dimensionality Reduction

Summary

15

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

16

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set

of principal variables

17

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis becomes less

meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set of principal

variables

Advantages of dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 6: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

6

Normalization

Min-max normalization to [new_minA new_maxA]

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

A

Avvσmicrominus

= Z-score The distance between the raw score and the population mean in the unit of the standard deviation

225100016

0005460073=

minus

7

Normalization

Min-max normalization to [new_minA new_maxA]

Z-score normalization (μ mean σ standard deviation)

Normalization by decimal scaling

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

A

Avvσmicrominus

= Z-score The distance between the raw score and the population mean in the unit of the standard deviation

Where j is the smallest integer such that Max(|νrsquo|) lt 1

8

Discretization

Three types of attributes Nominalmdashvalues from an unordered set eg color profession Ordinalmdashvalues from an ordered set eg military or academic rank Numericmdashreal numbers eg integer or real numbers

Discretization Divide the range of a continuous attribute into intervals Interval labels can then be used to replace actual data values Reduce data size by discretization Supervised vs unsupervised Split (top-down) vs merge (bottom-up) Discretization can be performed recursively on an attribute Prepare for further analysis eg classification

9

Data Discretization Methods

Binning Top-down split unsupervised

Histogram analysis Top-down split unsupervised

Clustering analysis Unsupervised top-down split or bottom-up merge

Decision-tree analysis Supervised top-down split

Correlation (eg χ2) analysis Unsupervised bottom-up merge

Note All the methods can be applied recursively

10

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

11

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals each containing approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky

12

Example Binning Methods for Data Smoothing

Sorted data for price (in dollars) 4 8 9 15 21 21 24 25 26 28 29 34 Partition into equal-frequency (equi-width) bins

- Bin 1 4 8 9 15- Bin 2 21 21 24 25- Bin 3 26 28 29 34

Smoothing by bin means- Bin 1 9 9 9 9- Bin 2 23 23 23 23- Bin 3 29 29 29 29

Smoothing by bin boundaries- Bin 1 4 4 4 15- Bin 2 21 21 25 25- Bin 3 26 26 26 34

13

Discretization by Classification amp Correlation Analysis

Classification (eg decision tree analysis)

Supervised Given class labels eg cancerous vs benign

Using entropy to determine split point (discretization point)

Top-down recursive split

Details to be covered in ldquoClassificationrdquo sessions

14

Chapter 3 Data Preprocessing

Data Preprocessing An Overview

Data Cleaning

Data Integration

Data Reduction and Transformation

Dimensionality Reduction

Summary

15

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

16

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set

of principal variables

17

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis becomes less

meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set of principal

variables

Advantages of dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 7: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

7

Normalization

Min-max normalization to [new_minA new_maxA]

Z-score normalization (μ mean σ standard deviation)

Normalization by decimal scaling

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__( +minusminus

minus=

A

Avvσmicrominus

= Z-score The distance between the raw score and the population mean in the unit of the standard deviation

Where j is the smallest integer such that Max(|νrsquo|) lt 1

8

Discretization

Three types of attributes Nominalmdashvalues from an unordered set eg color profession Ordinalmdashvalues from an ordered set eg military or academic rank Numericmdashreal numbers eg integer or real numbers

Discretization Divide the range of a continuous attribute into intervals Interval labels can then be used to replace actual data values Reduce data size by discretization Supervised vs unsupervised Split (top-down) vs merge (bottom-up) Discretization can be performed recursively on an attribute Prepare for further analysis eg classification

9

Data Discretization Methods

Binning Top-down split unsupervised

Histogram analysis Top-down split unsupervised

Clustering analysis Unsupervised top-down split or bottom-up merge

Decision-tree analysis Supervised top-down split

Correlation (eg χ2) analysis Unsupervised bottom-up merge

Note All the methods can be applied recursively

10

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

11

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals each containing approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky

12

Example Binning Methods for Data Smoothing

Sorted data for price (in dollars) 4 8 9 15 21 21 24 25 26 28 29 34 Partition into equal-frequency (equi-width) bins

- Bin 1 4 8 9 15- Bin 2 21 21 24 25- Bin 3 26 28 29 34

Smoothing by bin means- Bin 1 9 9 9 9- Bin 2 23 23 23 23- Bin 3 29 29 29 29

Smoothing by bin boundaries- Bin 1 4 4 4 15- Bin 2 21 21 25 25- Bin 3 26 26 26 34

13

Discretization by Classification amp Correlation Analysis

Classification (eg decision tree analysis)

Supervised Given class labels eg cancerous vs benign

Using entropy to determine split point (discretization point)

Top-down recursive split

Details to be covered in ldquoClassificationrdquo sessions

14

Chapter 3 Data Preprocessing

Data Preprocessing An Overview

Data Cleaning

Data Integration

Data Reduction and Transformation

Dimensionality Reduction

Summary

15

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

16

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set

of principal variables

17

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis becomes less

meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set of principal

variables

Advantages of dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 8: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

8

Discretization

Three types of attributes Nominalmdashvalues from an unordered set eg color profession Ordinalmdashvalues from an ordered set eg military or academic rank Numericmdashreal numbers eg integer or real numbers

Discretization Divide the range of a continuous attribute into intervals Interval labels can then be used to replace actual data values Reduce data size by discretization Supervised vs unsupervised Split (top-down) vs merge (bottom-up) Discretization can be performed recursively on an attribute Prepare for further analysis eg classification

9

Data Discretization Methods

Binning Top-down split unsupervised

Histogram analysis Top-down split unsupervised

Clustering analysis Unsupervised top-down split or bottom-up merge

Decision-tree analysis Supervised top-down split

Correlation (eg χ2) analysis Unsupervised bottom-up merge

Note All the methods can be applied recursively

10

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

11

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals each containing approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky

12

Example Binning Methods for Data Smoothing

Sorted data for price (in dollars) 4 8 9 15 21 21 24 25 26 28 29 34 Partition into equal-frequency (equi-width) bins

- Bin 1 4 8 9 15- Bin 2 21 21 24 25- Bin 3 26 28 29 34

Smoothing by bin means- Bin 1 9 9 9 9- Bin 2 23 23 23 23- Bin 3 29 29 29 29

Smoothing by bin boundaries- Bin 1 4 4 4 15- Bin 2 21 21 25 25- Bin 3 26 26 26 34

13

Discretization by Classification amp Correlation Analysis

Classification (eg decision tree analysis)

Supervised Given class labels eg cancerous vs benign

Using entropy to determine split point (discretization point)

Top-down recursive split

Details to be covered in ldquoClassificationrdquo sessions

14

Chapter 3 Data Preprocessing

Data Preprocessing An Overview

Data Cleaning

Data Integration

Data Reduction and Transformation

Dimensionality Reduction

Summary

15

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

16

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set

of principal variables

17

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis becomes less

meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set of principal

variables

Advantages of dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 9: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

9

Data Discretization Methods

Binning Top-down split unsupervised

Histogram analysis Top-down split unsupervised

Clustering analysis Unsupervised top-down split or bottom-up merge

Decision-tree analysis Supervised top-down split

Correlation (eg χ2) analysis Unsupervised bottom-up merge

Note All the methods can be applied recursively

10

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

11

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals each containing approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky

12

Example Binning Methods for Data Smoothing

Sorted data for price (in dollars) 4 8 9 15 21 21 24 25 26 28 29 34 Partition into equal-frequency (equi-width) bins

- Bin 1 4 8 9 15- Bin 2 21 21 24 25- Bin 3 26 28 29 34

Smoothing by bin means- Bin 1 9 9 9 9- Bin 2 23 23 23 23- Bin 3 29 29 29 29

Smoothing by bin boundaries- Bin 1 4 4 4 15- Bin 2 21 21 25 25- Bin 3 26 26 26 34

13

Discretization by Classification amp Correlation Analysis

Classification (eg decision tree analysis)

Supervised Given class labels eg cancerous vs benign

Using entropy to determine split point (discretization point)

Top-down recursive split

Details to be covered in ldquoClassificationrdquo sessions

14

Chapter 3 Data Preprocessing

Data Preprocessing An Overview

Data Cleaning

Data Integration

Data Reduction and Transformation

Dimensionality Reduction

Summary

15

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

16

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set

of principal variables

17

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis becomes less

meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set of principal

variables

Advantages of dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 10: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

10

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

11

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals each containing approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky

12

Example Binning Methods for Data Smoothing

Sorted data for price (in dollars) 4 8 9 15 21 21 24 25 26 28 29 34 Partition into equal-frequency (equi-width) bins

- Bin 1 4 8 9 15- Bin 2 21 21 24 25- Bin 3 26 28 29 34

Smoothing by bin means- Bin 1 9 9 9 9- Bin 2 23 23 23 23- Bin 3 29 29 29 29

Smoothing by bin boundaries- Bin 1 4 4 4 15- Bin 2 21 21 25 25- Bin 3 26 26 26 34

13

Discretization by Classification amp Correlation Analysis

Classification (eg decision tree analysis)

Supervised Given class labels eg cancerous vs benign

Using entropy to determine split point (discretization point)

Top-down recursive split

Details to be covered in ldquoClassificationrdquo sessions

14

Chapter 3 Data Preprocessing

Data Preprocessing An Overview

Data Cleaning

Data Integration

Data Reduction and Transformation

Dimensionality Reduction

Summary

15

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

16

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set

of principal variables

17

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis becomes less

meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set of principal

variables

Advantages of dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 11: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

11

Simple Discretization Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size uniform grid

if A and B are the lowest and highest values of the attribute the width of intervals will be W = (B ndashA)N

The most straightforward but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals each containing approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky

12

Example Binning Methods for Data Smoothing

Sorted data for price (in dollars) 4 8 9 15 21 21 24 25 26 28 29 34 Partition into equal-frequency (equi-width) bins

- Bin 1 4 8 9 15- Bin 2 21 21 24 25- Bin 3 26 28 29 34

Smoothing by bin means- Bin 1 9 9 9 9- Bin 2 23 23 23 23- Bin 3 29 29 29 29

Smoothing by bin boundaries- Bin 1 4 4 4 15- Bin 2 21 21 25 25- Bin 3 26 26 26 34

13

Discretization by Classification amp Correlation Analysis

Classification (eg decision tree analysis)

Supervised Given class labels eg cancerous vs benign

Using entropy to determine split point (discretization point)

Top-down recursive split

Details to be covered in ldquoClassificationrdquo sessions

14

Chapter 3 Data Preprocessing

Data Preprocessing An Overview

Data Cleaning

Data Integration

Data Reduction and Transformation

Dimensionality Reduction

Summary

15

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

16

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set

of principal variables

17

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis becomes less

meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set of principal

variables

Advantages of dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 12: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

12

Example Binning Methods for Data Smoothing

Sorted data for price (in dollars) 4 8 9 15 21 21 24 25 26 28 29 34 Partition into equal-frequency (equi-width) bins

- Bin 1 4 8 9 15- Bin 2 21 21 24 25- Bin 3 26 28 29 34

Smoothing by bin means- Bin 1 9 9 9 9- Bin 2 23 23 23 23- Bin 3 29 29 29 29

Smoothing by bin boundaries- Bin 1 4 4 4 15- Bin 2 21 21 25 25- Bin 3 26 26 26 34

13

Discretization by Classification amp Correlation Analysis

Classification (eg decision tree analysis)

Supervised Given class labels eg cancerous vs benign

Using entropy to determine split point (discretization point)

Top-down recursive split

Details to be covered in ldquoClassificationrdquo sessions

14

Chapter 3 Data Preprocessing

Data Preprocessing An Overview

Data Cleaning

Data Integration

Data Reduction and Transformation

Dimensionality Reduction

Summary

15

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

16

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set

of principal variables

17

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis becomes less

meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set of principal

variables

Advantages of dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 13: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

13

Discretization by Classification amp Correlation Analysis

Classification (eg decision tree analysis)

Supervised Given class labels eg cancerous vs benign

Using entropy to determine split point (discretization point)

Top-down recursive split

Details to be covered in ldquoClassificationrdquo sessions

14

Chapter 3 Data Preprocessing

Data Preprocessing An Overview

Data Cleaning

Data Integration

Data Reduction and Transformation

Dimensionality Reduction

Summary

15

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

16

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set

of principal variables

17

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis becomes less

meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set of principal

variables

Advantages of dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 14: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

14

Chapter 3 Data Preprocessing

Data Preprocessing An Overview

Data Cleaning

Data Integration

Data Reduction and Transformation

Dimensionality Reduction

Summary

15

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

16

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set

of principal variables

17

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis becomes less

meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set of principal

variables

Advantages of dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 15: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

15

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

16

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set

of principal variables

17

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis becomes less

meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set of principal

variables

Advantages of dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 16: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

16

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis

becomes less meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set

of principal variables

17

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis becomes less

meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set of principal

variables

Advantages of dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 17: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

17

Dimensionality Reduction

Curse of dimensionality When dimensionality increases data becomes increasingly sparse Density and distance between points which is critical to clustering outlier analysis becomes less

meaningful The possible combinations of subspaces will grow exponentially

Dimensionality reduction Reducing the number of random variables under consideration via obtaining a set of principal

variables

Advantages of dimensionality reduction Avoid the curse of dimensionality Help eliminate irrelevant features and reduce noise Reduce time and space required in data mining Allow easier visualization

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 18: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

18

Dimensionality Reduction Techniques

Dimensionality reduction methodologies

Feature selection Find a subset of the original variables (or features attributes)

Feature extraction Transform the data in the high-dimensional space to a space of fewer dimensions

Some typical dimensionality reduction methods

Principal Component Analysis

Supervised and nonlinear techniques

Feature subset selection

Feature creation

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 19: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

19

PCA A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

The original data are projected onto a much smaller space resulting in dimensionality reduction

Method Find the eigenvectors of the covariance matrix and these eigenvectors define the new space

Ball travels in a straight line Data from three cameras contain much redundancy

Principal Component Analysis (PCA)

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 20: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

21

Principal Components Analysis Intuition

Goal is to find a projection that captures the largest amount of variation in data

Find the eigenvectors of the covariance matrix The eigenvectors define the new space

x2

x1

e

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 21: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

22

Principal Component Analysis Details

Let A be an n timesn matrix representing the correlation or covariance of the data λ is an eigenvalue of A if there exists a non-zero vector v such that

Av = λ v often rewritten as (A- λI)v=0

In this case vector v is called an eigenvector of A corresponding to λ For each eigenvalue λ the set of all vectors v satisfying Av = λ v is called the eigenspace of A corresponding to λ

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 22: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

23

Attribute Subset Selection

Another way to reduce dimensionality of data

Redundant attributes Duplicate much or all of the information contained in

one or more other attributes

Eg purchase price of a product and the amount of sales tax paid

Irrelevant attributes Contain no information that is useful for the data

mining task at hand

Ex A studentrsquos ID is often irrelevant to the task of predicting hisher GPA

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 23: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

24

Heuristic Search in Attribute Selection

There are 2d possible attribute combinations of d attributes Typical heuristic attribute selection methods

Best single attribute under the attribute independence assumption choose by significance tests

Best step-wise feature selection The best single-attribute is picked first Then next best attribute condition to the first

Step-wise attribute elimination Repeatedly eliminate the worst attribute

Best combined attribute selection and elimination Optimal branch and bound Use attribute elimination and backtracking

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 24: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

25

Attribute Creation (Feature Generation)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

Three general methodologies Attribute extraction Domain-specific

Mapping data to new space (see data reduction) Eg Fourier transformation wavelet transformation manifold approaches (not covered)

Attribute construction Combining features (see discriminative frequent patterns in Chapter on ldquoAdvanced

Classificationrdquo) Data discretization

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 25: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

26

Summary

Data quality accuracy completeness consistency timeliness believability interpretability

Data cleaning eg missingnoisy values outliers

Data integration from multiple sources

Entity identification problem Remove redundancies Detect inconsistencies

Data reduction

Dimensionality reduction Numerosity reduction Data compression

Data transformation and data discretization

Normalization Concept hierarchy generation

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 26: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

27

D P Ballou and G K Tayi Enhancing data quality in data warehouse environments Comm of ACM 4273-78 1999

T Dasu and T Johnson Exploratory Data Mining and Data Cleaning John Wiley 2003 T Dasu T Johnson S Muthukrishnan V Shkapenyuk Mining Database Structure Or How to Build a Data

Quality Browser SIGMODrsquo02 H V Jagadish et al Special Issue on Data Reduction Techniques Bulletin of the Technical Committee on

Data Engineering 20(4) Dec 1997 D Pyle Data Preparation for Data Mining Morgan Kaufmann 1999 E Rahm and H H Do Data Cleaning Problems and Current Approaches IEEE Bulletin of the Technical

Committee on Data Engineering Vol23 No4 V Raman and J Hellerstein Potters Wheel An Interactive Framework for Data Cleaning and

Transformation VLDBrsquo2001 T Redman Data Quality Management and Technology Bantam Books 1992 R Wang V Storey and C Firth A framework for analysis of data quality research IEEE Trans

Knowledge and Data Engineering 7623-640 1995

References

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 27: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

CS 412 INTRO TO DATA MINING

Classification Basic Concepts Huan Sun CSEThe Ohio State University

09052017

28Slides adapted from UIUC CS412 Fall 2017 by Prof Jiawei Han

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 28: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

29

Classification Basic Concepts Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 29: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

30

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 30: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

31

Supervised vs Unsupervised Learning Supervised learning (classification)

Supervision The training data (observations measurements etc) are accompanied

by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements observations etc with the aim of establishing the

existence of classes or clusters in the data

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 31: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

32

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 32: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

33

Prediction Problems Classification vs Numeric Prediction Classification

predicts categorical class labels (discrete or nominal)

classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction

models continuous-valued functions ie predicts unknown or missing values

Typical applications

Creditloan approval

Medical diagnosis if a tumor is cancerous or benign

Fraud detection if a transaction is fraudulent

Web page categorization which category it is

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 33: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

34

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 34: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

35

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 35: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

36

ClassificationmdashA Two-Step Process(1) Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set Model represented as classification rules decision trees or mathematical formulae

(2) Model usage for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy of test set samples that are correctly classified by the model Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable use the model to classify new data

Note If the test set is used to selectrefine models it is called validation (test) set or development test set

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 36: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

37

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 37: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

Sheet1

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 38: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

38

Step (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 39: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

Sheet1

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Page 40: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

39

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Page 41: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

Sheet1

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Page 42: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

40

Step (2) Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

NewUnseen Data

(Jeff Professor 4)

Tenured

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Page 43: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

Sheet1

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Page 44: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

41

Classification Basic Concepts

Classification Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Model Evaluation and Selection

Techniques to Improve Classification Accuracy Ensemble Methods

Summary

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 45: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

42

Decision Tree Induction An Example

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis)

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 46: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

Sheet1

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 47: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

43

Decision Tree Induction An Example

age

overcast

student credit rating

lt=30 gt40

no yes yes

yes

3140

fairexcellentyesno

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

Training data set Buys_computer The data set follows an example of Quinlanrsquos

ID3 (Playing Tennis) Resulting tree

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 48: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

Sheet1

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 49: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

44

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain)

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 50: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

45

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start all the training examples are at the root Attributes are categorical (if continuous-valued they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (eg

information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmdashmajority voting is

employed for classifying the leaf There are no samples left

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 51: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

46

Brief Review of Entropy Entropy (Information Theory)

A measure of uncertainty associated with a random number Calculation For a discrete random variable Y taking m distinct values y1 y2 hellip ym

Interpretation Higher entropy rarr higher uncertainty Lower entropy rarr lower uncertainty

Conditional entropy

m = 2

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 52: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

47

Attribute Selection Measure Information Gain (ID3C45)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci

estimated by |Ci D||D| Expected information (entropy) needed to classify a tuple in D

Information needed (after using A to split D into v partitions) to classify D

Information gained by branching on attribute A

)(log)( 21

i

m

ii ppDInfo sum

=

minus=

)(||||

)(1

j

v

j

jA DInfo

DD

DInfo times=sum=

(D)InfoInfo(D)Gain(A) Aminus=

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 53: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

48

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

How to select the first attribute

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 54: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

Sheet1

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 55: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

49

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 56: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

Sheet1

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 57: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

50

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 58: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

Sheet1

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 59: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

51

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 60: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

Sheet1

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 61: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

52

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age pi ni I(pi ni)lt=30 2 3 097131hellip40 4 0 0gt40 3 2 0971

Look at ldquoagerdquo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

means ldquoage lt=30rdquo has 5 out of 14 samples with 2 yesrsquoes and 3 norsquos

)32(145 I

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 62: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

53

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 63: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

Sheet1

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 64: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

54

Attribute Selection Information Gain Class P buys_computer = ldquoyesrdquo Class N buys_computer = ldquonordquo

age income student credit_rating buys_computerlt=30 high no fair nolt=30 high no excellent no31hellip40 high no fair yesgt40 medium no fair yesgt40 low yes fair yesgt40 low yes excellent no31hellip40 low yes excellent yeslt=30 medium no fair nolt=30 low yes fair yesgt40 medium yes fair yeslt=30 medium yes excellent yes31hellip40 medium no excellent yes31hellip40 high yes fair yesgt40 medium no excellent no

9400)145(log

145)

149(log

149)59()( 22 =minusminus== IDInfo

6940)23(145

)04(144)32(

145)(

=+

+=

I

IIDInfoage

2460)()()( =minus= DInfoDInfoageGain age

Similarly

0480)_(1510)(0290)(

===

ratingcreditGainstudentGainincomeGain How

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no
Page 65: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/...Nominal—values from an unordered set, e.g., color, profession Ordinal—values from an ordered set,

Sheet1

  • CSE 5243 Intro to Data Mining
  • Chapter 3 Data Preprocessing
  • Data Transformation
  • Data Transformation
  • Normalization
  • Normalization
  • Normalization
  • Discretization
  • Data Discretization Methods
  • Simple Discretization Binning
  • Simple Discretization Binning
  • Example Binning Methods for Data Smoothing
  • Discretization by Classification amp Correlation Analysis
  • Chapter 3 Data Preprocessing
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction
  • Dimensionality Reduction Techniques
  • Principal Component Analysis (PCA)
  • Principal Components Analysis Intuition
  • Principal Component Analysis Details
  • Attribute Subset Selection
  • Heuristic Search in Attribute Selection
  • Attribute Creation (Feature Generation)
  • Summary
  • References
  • CS 412 Intro to Data Mining
  • Classification Basic Concepts
  • Supervised vs Unsupervised Learning
  • Supervised vs Unsupervised Learning
  • Prediction Problems Classification vs Numeric Prediction
  • Prediction Problems Classification vs Numeric Prediction
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • ClassificationmdashA Two-Step Process
  • Step (1) Model Construction
  • Step (1) Model Construction
  • Step (2) Using the Model in Prediction
  • Step (2) Using the Model in Prediction
  • Classification Basic Concepts
  • Decision Tree Induction An Example
  • Decision Tree Induction An Example
  • Algorithm for Decision Tree Induction
  • Algorithm for Decision Tree Induction
  • Brief Review of Entropy
  • Attribute Selection Measure Information Gain (ID3C45)
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
  • Attribute Selection Information Gain
age income student credit_rating buys_computer
lt=30 high no fair no
lt=30 high no excellent no
31hellip40 high no fair yes
gt40 medium no fair yes
gt40 low yes fair yes
gt40 low yes excellent no
31hellip40 low yes excellent yes
lt=30 medium no fair no
lt=30 low yes fair yes
gt40 medium yes fair yes
lt=30 medium yes excellent yes
31hellip40 medium no excellent yes
31hellip40 high yes fair yes
gt40 medium no excellent no