Business Intelligence Technologies – Data Mining Lecture 4 Classification, Decision Tree 1.

Business Intelligence Technologies – Data Mining

Lecture 4 Classification, Decision Tree

1

Age Income City Gender Response

30 50K New York M No

50 125K Tampa F Yes

… … … … Yes

… … … … No

28 35K Orlando M ???

… .. … … ???

… .. … … ???

Classification

Mapping instances onto a predefined set of classes. Examples

Classify each customers as “Responder” versus “Non-Responder”

Classify cellular calls as “legitimate” versus “fraudulent”

2

Prediction Broad definition: build model to estimate any

type of values (predictive data mining) Narrow definition: estimate continuous values

Examples Predict how much money a customer will spend Predict the value of a stock

Age Income City Gender Dollar Spent

30 50K New York M $150

50 125K Tampa F $400

… … … … $0

… … … … $230

28 35K Orlando M ???

… .. … … ???

… .. … … ???3

Classification: Terminology

Inputs = Predictors = Independent Variables Outputs = Responses = Dependent Variables Models = Classifiers Data points: examples, instances With classification, we want to build a

(classification) model to predict the outputs given the inputs.

4

Steps in Classification

Input & Output Step 1:Building the Model

Step 2:Validating the Model

Step 3:Applying/Using

the Model

Model

ValidatedModel

OutputInput only

Input & Output

Training Data

Test Data

New Data

5

Common Classification Techniques

Decision tree Logistics regression K-nearest neighbors Neural Network Naïve Bayes

6

Balance

>=50K<50K

Age

<=45>45

Employed

Class=NotDefault

NoYes

Class=Default

Root

Leaf

Class=NotDefault

Class=NotDefault

Node

Name Balance Age Emp. DefaultMike 23,000 30 yes noMary 51,100 40 yes noBill 48,000 40 no noJim 53,000 45 no yesDave 65,000 60 no noAnne 30,000 35 no no

Decision Tree --- An Example

7

Decision Tree Representation

A series of nested tests:

Each node represents a teston one attribute Nominal attributes: each branch

could represent one or more values Numeric attributes are split into

ranges, normally binary split

Leaves A class assignment (E.g, Default /Not

default)

Balance

>=50K<50K

Age

<=45>45

Employed

Class=No

NoYes

Class=No

Class=No Class=Yes8

The Use of a Decision Tree: Classifying New Instances

To determine the class of a new instance: e.g., Mark, age 40, unemployed, balance 88K. The instance is routed down the

tree according to values of attributes.

At each node a test is applied to one attribute.

When a leaf is reached the instance is assigned to a class.

Mark: Yes

Balance

>=50K<50K

Age

<=45>45

Employed

Class=No

NoYes

Class=No

Class=No Class=Yes9

Goal of Decision Tree Construction Partition the training instances into purer sub

groups pure: the instances in a sub-group mostly belong to the

same class

Entire populationAge<45

Age>=45

Balance<50K

Balance>=50K

Age>=45

Age<45Default

Not default

How to build a tree: How to split instances into purer sub-groups

10

Why do we want to identify pure sub groups?

To classify a new instance, we can determine the leaf that the instance belongs to based on its attributes.

If the leaf is very pure (e.g. all have defaulted) we can determine with greater confidence that the new instance belongs to this class (i.e., the “Default” class.)

If the leaf is not very pure (e.g. a 50%/50% mixture of the two classes, Default and Not Default), our prediction for the new instance is more like a random guessing.

11

A tree is constructed by recursively partitioning the examples.

With each partition the examples are split into increasingly purer sub groups.

The key in building a tree: How to split

Decision Tree Construction

12

Building a Tree - Choosing a Split

Few Medium PAYSFew High PAYS

PhillyPhilly

CityPhillyPhilly

ChildrenManyMany

IncomeMediumLow

StatusDEFAULTSDEFAULTS

34

ApplicantID12

Try split on Children attribute:

Try split on Income attribute:

Children

Many

Few

Income

Low

High

Medium

Notice how the split on the Children attribute gives purer partitions.It is therefore chosen as the first split (and in this case the only split – because the two sub-groups are 100% pure).

13

Better, as purity of sub-nodes is improving.

Recursive Steps in Building a TreeExample

STEP 1: Split Option A

Not good as sub-nodes are still very heterogenous!

STEP 1: Split Option B

STEP 2: Choose Split Option B as it is the better split.

STEP 3: Try out splits on each of the sub-nodes of Split Option B.Eventually, we arrive at:

Notice how examples in a parent node are split between sub-nodes - i.e. notice how the training examples are partitioned into smaller and smaller subsets. Also, notice that sub-nodes are purer than parent nodes. 14

Recursive Steps in Building a Tree

STEP 1: Try using different attributes to split the training examples into different

subsets.

STEP 2: Rank the splits. Choose the best split.

STEP 3: For each node obtained by splitting, repeat from STEP 1, until no more

good splits are possible.

Note: Usually it is not possible to create leaves that are completely pure - i.e. contain one class only - as that would result in a very bushy tree which is not sufficiently general. However, it is possible to create leaves that are purer – i.e. contain predominantly one class - and we can settle for that.

15

Purity Measures

Purity measures: Many available Gini (population diversity)

Entropy (information gain)

Information Gain Ratio

Chi-square Test

Most common one (from information theory) is: Information Gain Informally: How informative is the attribute in distinguishing

among instances (e.g., credit applicants) from different classes (Yes/No default)

16

Information GainConsider the two following splits.

Which one is more informative?

Split over whether Balance exceeds 50K

Over 50KLess or equal 50K EmployedUnemployed

Split over whether applicant is employed

17

Information Gain Impurity/Entropy:

Measures the level of impurity/chaos in a group of examples Information gain is defined as the decrease in impurity with

the split generating more pure sub-groups

18

Impurity

Very impure group Less impure Minimum impurity

When examples can belong to one of two classes: What is the worst case of impurity?

19

Calculating Impurity

i

ii pp 2log

997.030

16log

30

16

30

14log

30

1422

20

2-class Cases:

What is the impurity of a group in which all examples belong to the same class? Impurity= - 1 log21 - 0 log20 = 0 (lowest possible value)

What is the impurity of a group with 50% in either class? Impurity= -0.5 log20.5 – 0.5 log20.5 =1

(highest possible value)

Minimum impurity

Maximumimpurity

0 log(0) is defined as 021

Calculating Information Gain

997.030

16log

30

16

30

14log

30

1422

impurity

787.017

4log

17

4

17

13log

17

1322

impurity

Entire population (30 instances)

Balance<50K

Balance>=50K

17 instances

13 instances

(Weighted) Average Impurity of Children = 615.0391.030

13787.0

30

17

Information Gain= 0.997 - 0.615 = 0.382

391.013

12log

13

12

13

1log

13

122

impurity

Information Gain = Impurity (parent) – Impurity (children)

22

Information Gain of a Split

The weighted average of the impurity of the children nodes after a split is compared with the impurity of the parent node to see how much information is gained through the split.

A child node with fewer examples in it has a lower weight

23

Information Gain = Impurity (parent) – Impurity (children)

Impurity(A) =0.997Impurity(B,C) = 0.615

Gain=0.382

Gain=0.381

Age<45

Age>=45

Balance<50K

Balance>=50KEntire population

Impurity(D,E) =0.406

Information Gain

Impurity(B) = 0.787

Impurity (C)= 0.391

Impurity(D)=0 Log20 +1 log21=0

Impurity(E) =-3/7 Log23/7 -

4/7Log24/7=0.985

A

B

C

D

E

24

Which attribute to split over?

At each node examine splits over each of the attributes

Select the attribute for which the maximum information gain is obtained For a continuous attribute, also need to consider different

ways of splitting (>50 or <=50; >60 or <=60)

For a categorical attribute with lots of possible values, sometimes also need to consider how to group these values ( branch 1 corresponds to {A,B,E} and branch 2 corresponds to {C,D,F,G})

25

Calculating the Information Gain of a Split

26

1. For each sub-group produced by the split, calculate the impurity/entropy of that subset.

2. Calculate the weighted impurity of the split by weighting each sub-group’s impurity by the proportion of training examples (out of the training examples in the parent node) that are in that subset.

3. Calculate the impurity of the parent node, and subtract the weighted impurity of the child nodes to obtain the information gain for the split.

Note: If impurity is increasing with the split then the tree is getting worse, so we wouldn’t want to make the split ! So, information gain of the split needs to be positive in order to choose to split. 26

Person Hair Length

Weight Age Class

Homer 0” 250 36 M

Marge 10” 150 34 F

Bart 2” 90 10 M

Lisa 6” 78 8 F

Maggie 4” 20 1 F

Abe 1” 170 70 M

Selma 8” 160 41 F

Otto 10” 180 38 M

Krusty 6” 200 45 M

Comic 8” 290 38 ?27

Hair Length <= 5?yes no

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911

Entropy(1F,3M) = -(1/4)log2(1/4) - (3/4)log

2(3/4)

= 0.8113


2(2/5)

= 0.9710

Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911

Let us try splitting on Hair length

Let us try splitting on Hair length

Gain= Entropy of parent – Weighted average of entropies of the children

28

Weight <= 160?yes no



2(1/5)

= 0.7219


2(4/4)

= 0

Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900

Let us try splitting on Weight

Let us try splitting on Weight

29

age <= 40?yes no



2(3/6)

= 1


2(2/3)

= 0.9183

Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183

Let us try splitting on Age

Let us try splitting on Age

30

Weight <= 160?yes no

Hair Length <= 2?yes no

Of the 3 features we had, Weight was the best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified… So we simply continue splitting!

This time we find that we can split on Hair length, and then we are done!

Aha, I got a tree!Note: the splitting decision is not only to choose attribute, but to choose the value to split for a continuous attribute (e.g. Hair <=5 or Hair <=2) 31

Building a Tree - Stopping Criteria

You can stop building the tree when: The impurity of all nodes is zero: Problem is that this

tends to lead to bushy, highly-branching trees, often with one example at each node.

No split achieves a significant gain in purity (information gain not high enough)

Node size is too small: That is, there are less than a certain number of examples, or proportion of the training set, at each node.

32

Reading Rules off the Decision Tree

IF Income=High AND Gender=Male AND Children=Few THEN Non-Responder

For each leaf in the tree, read the rule from the root to that leaf.You will arrive at a set of rules.

Income

Debts

GenderChildren

Responder

Non-Responder

Non-Responder

Low

Male

High

Low

High

Female

Many

Few

IF Income=Low AND Debts=Low THEN Non-Responder

IF Income=Low AND Debts=High THEN Responder

IF Income=High AND Gender=Male AND Children=Many THEN Responder

Non-Responder

Responder

IF Income=High AND Gender=Female THEN Non-Responder

33

Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

If the accuracy on testing data is acceptable, we can use the model to classify instances whose class labels are not known

34

There are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets.

Glasses

YesNo

Hair

Female

ShortLong

Male Female

Hair

Female

ShortLong

Male

Name Hair Glasses ClassMary Long No FemaleMike Short No MaleBill Short No MaleJane Long No FemaleAnn Short Yes Female

Hair Glasses Tree 1 Tree 2 TRUEShort Yes Female Male MaleShort No Male Male FemaleLong No Female Female FemaleShort Yes Female Male Male

Error: 75% 25%

Training

Testing

Tree 1

Tree 2

35

Overfitting & Underfitting

Notice how the error rate on the testing data increases for overly large trees.

Overfitting: the model performs poorly on new examples (e.g. testing examples) as it is too highly trained to the specific training examples (pick up patterns and noises).

Underfitting: the model performs poorly on new examples as it is too simplistic to distinguish between them (i.e. has not picked up the important patterns from the training examples)

underfitting overfitting

36

PruningA decision trees is typically more accurate on its training data than on its test data. Removing branches from a tree can often improve its accuracy on a test set - so-called ‘reduced error pruning’. The intention of this pruning is to cut off branches from the tree when this improves performance on test data - this reduces overfitting and makes the tree more general.

Small is beautiful.

37

Decision tree A tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution

Decision tree generation consists of two phases Tree construction

At start, all the training examples are at the root Partition examples recursively based on selected attributes

Tree pruning Identify and remove branches that reflect noise or outliers To avoid overfitting

Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree

Decision Tree Classification in a Nutshell

38

Strengths & Weaknesses In practice: One of the most popular method. Why?

Very comprehensible – the tree structure specifies the entire decision structure

Easy for decision makers to understand model’s rational Map nicely to a set of business rules

Relatively easy to implement Very fast to run (to classify examples) with large data sets Good at handling missing values: just treat “missing” as a

value – can become a good predictor Weakness

Bad at handling continuous data, good at categorical input and output. Continuous output: high error rate Continuous input: ranges may introduce bias

39

Decision Tree VariationsRegression Trees

The leaves of a regression tree predict the average value for examples that reach that node.

Age

TrustFund

RetirementFund

<18

No

Yes

>65Yes

No

Professional18-65

Average Income = $10,000 p.a.Standard Deviation = $1,000

Yes

No

Average Income = $100 p.a.Standard Deviation = $10



Average Income = $20,000 p.a.Standard Deviation = $800

Average Income = $1,000 p.a.Standard Deviation = $150

40

Decision Tree VariationsModel Trees

The leaves of a model tree specify a function that can be used to predict the value of examples that reach that node.

Age

TrustFund

RetirementFund

<18

No

Yes

>65Yes

No

Professional18-65

Income Per Annum = Number Of Trust Funds * $5,000

Yes

No

Income Per Annum = Age * $50

Income Per Annum = $80,000 + (Age * $2,000)

Income Per Annum = FundValue * 10%

Income Per Annum = $100 * Number of Children

Income Per Annum = $20,000 + (Age * $1,000)

41

Classification Tree vs. Regression Tree Classification Tree:

Categorical output The leaves of the tree assign a class label or a probability of being in a

class Achieve nodes such that one class is predominant at each node.

Regression Tree: Numeric output The leaves of the tree assign an average value (regression trees) or

specify a function that can be used to compute a value (model trees) for examples that reach that node.

Achieve nodes such that means between nodes vary as much as possible and standard-deviation or variance within each node is as low as possible.

42

Different Decision Tree Algorithms ID3, ID4, ID5, C4.0, C4.5, C5.0, ACLS, and

ASSISTANT: Use information gain as splitting criterion

CART (Classification And Regression Trees): Uses Gini diversity index as measure of impurity when deciding

splitting.

CHAID: A statistical approach that uses the Chi-squared test when

deciding on the best split.

Hunt’s Concept Learning System (CLS), and MINIMAX: Minimizes the cost of classifying examples correctly or

incorrectly. 43

10-fold Cross Validation

Break data into 10 sets of size n/10. Train on 9 datasets and test on 1. Repeat 10 times and take a mean

accuracy.

44

45

SAS: % Captured Response

46

SAS: Lift Value

In SAS, %Response = Percentage of Responses in the top n% ranked individuals. It should be relatively high in the top deciles. And a decreasing plotted curve indicates a good model. The lift chart captures the same informationon a different scale.

Case Discussion Fleet

1. How many input variables are used to build the tree? How many show up in the tree built? Why?

2. How can the tree built be used for segmentation?

3. How can the new campaign results help enhance the tree?

HIV1. How does the pruning process work?

2. How does decision tree pick the interactions among variables?

47

Exercise – Decision TreeCustom

er ID

Student

Credit Rating

Class: Buy PDA

1 No Fair No

2 No Excellent No

3 No Fair Yes

4 No Fair Yes

5 Yes Fair Yes

6 Yes Excellent No

7 Yes Excellent Yes

8 No Excellent No

Which attribute to split on first?

log2(2/3) = -0.585, log2(1/3) = -1.585, log2(1/2) = -1, log2(3/5) = -0.737, Log2(2/5) = -1.322, log2(1/4) = -2, log2(3/4) = -0.415

48

Business Intelligence Technologies – Data Mining Lecture 4 Classification, Decision Tree 1.

Documents