Business Intelligence Technologies – Data Mining Lecture 4 Classification, Decision Tree 1
Business Intelligence Technologies – Data Mining
Lecture 4 Classification, Decision Tree
1
Age Income City Gender Response
30 50K New York M No
50 125K Tampa F Yes
… … … … Yes
… … … … No
28 35K Orlando M ???
… .. … … ???
… .. … … ???
Classification
Mapping instances onto a predefined set of classes. Examples
Classify each customers as “Responder” versus “Non-Responder”
Classify cellular calls as “legitimate” versus “fraudulent”
2
Prediction Broad definition: build model to estimate any
type of values (predictive data mining) Narrow definition: estimate continuous values
Examples Predict how much money a customer will spend Predict the value of a stock
Age Income City Gender Dollar Spent
30 50K New York M $150
50 125K Tampa F $400
… … … … $0
… … … … $230
28 35K Orlando M ???
… .. … … ???
… .. … … ???3
Classification: Terminology
Inputs = Predictors = Independent Variables Outputs = Responses = Dependent Variables Models = Classifiers Data points: examples, instances With classification, we want to build a
(classification) model to predict the outputs given the inputs.
4
Steps in Classification
Input & Output Step 1:Building the Model
Step 2:Validating the Model
Step 3:Applying/Using
the Model
Model
ValidatedModel
OutputInput only
Input & Output
Training Data
Test Data
New Data
5
Common Classification Techniques
Decision tree Logistics regression K-nearest neighbors Neural Network Naïve Bayes
6
Balance
>=50K<50K
Age
<=45>45
Employed
Class=NotDefault
NoYes
Class=Default
Root
Leaf
Class=NotDefault
Class=NotDefault
Node
Name Balance Age Emp. DefaultMike 23,000 30 yes noMary 51,100 40 yes noBill 48,000 40 no noJim 53,000 45 no yesDave 65,000 60 no noAnne 30,000 35 no no
Decision Tree --- An Example
7
Decision Tree Representation
A series of nested tests:
Each node represents a teston one attribute Nominal attributes: each branch
could represent one or more values Numeric attributes are split into
ranges, normally binary split
Leaves A class assignment (E.g, Default /Not
default)
Balance
>=50K<50K
Age
<=45>45
Employed
Class=No
NoYes
Class=No
Class=No Class=Yes8
The Use of a Decision Tree: Classifying New Instances
To determine the class of a new instance: e.g., Mark, age 40, unemployed, balance 88K. The instance is routed down the
tree according to values of attributes.
At each node a test is applied to one attribute.
When a leaf is reached the instance is assigned to a class.
Mark: Yes
Balance
>=50K<50K
Age
<=45>45
Employed
Class=No
NoYes
Class=No
Class=No Class=Yes9
Goal of Decision Tree Construction Partition the training instances into purer sub
groups pure: the instances in a sub-group mostly belong to the
same class
Entire populationAge<45
Age>=45
Balance<50K
Balance>=50K
Age>=45
Age<45Default
Not default
How to build a tree: How to split instances into purer sub-groups
10
Why do we want to identify pure sub groups?
To classify a new instance, we can determine the leaf that the instance belongs to based on its attributes.
If the leaf is very pure (e.g. all have defaulted) we can determine with greater confidence that the new instance belongs to this class (i.e., the “Default” class.)
If the leaf is not very pure (e.g. a 50%/50% mixture of the two classes, Default and Not Default), our prediction for the new instance is more like a random guessing.
11
A tree is constructed by recursively partitioning the examples.
With each partition the examples are split into increasingly purer sub groups.
The key in building a tree: How to split
Decision Tree Construction
12
Building a Tree - Choosing a Split
Few Medium PAYSFew High PAYS
PhillyPhilly
CityPhillyPhilly
ChildrenManyMany
IncomeMediumLow
StatusDEFAULTSDEFAULTS
34
ApplicantID12
Try split on Children attribute:
Try split on Income attribute:
Children
Many
Few
Income
Low
High
Medium
Notice how the split on the Children attribute gives purer partitions.It is therefore chosen as the first split (and in this case the only split – because the two sub-groups are 100% pure).
13
Better, as purity of sub-nodes is improving.
Recursive Steps in Building a TreeExample
STEP 1: Split Option A
Not good as sub-nodes are still very heterogenous!
STEP 1: Split Option B
STEP 2: Choose Split Option B as it is the better split.
STEP 3: Try out splits on each of the sub-nodes of Split Option B.Eventually, we arrive at:
Notice how examples in a parent node are split between sub-nodes - i.e. notice how the training examples are partitioned into smaller and smaller subsets. Also, notice that sub-nodes are purer than parent nodes. 14
Recursive Steps in Building a Tree
STEP 1: Try using different attributes to split the training examples into different
subsets.
STEP 2: Rank the splits. Choose the best split.
STEP 3: For each node obtained by splitting, repeat from STEP 1, until no more
good splits are possible.
Note: Usually it is not possible to create leaves that are completely pure - i.e. contain one class only - as that would result in a very bushy tree which is not sufficiently general. However, it is possible to create leaves that are purer – i.e. contain predominantly one class - and we can settle for that.
15
Purity Measures
Purity measures: Many available Gini (population diversity)
Entropy (information gain)
Information Gain Ratio
Chi-square Test
Most common one (from information theory) is: Information Gain Informally: How informative is the attribute in distinguishing
among instances (e.g., credit applicants) from different classes (Yes/No default)
16
Information GainConsider the two following splits.
Which one is more informative?
Split over whether Balance exceeds 50K
Over 50KLess or equal 50K EmployedUnemployed
Split over whether applicant is employed
17
Information Gain Impurity/Entropy:
Measures the level of impurity/chaos in a group of examples Information gain is defined as the decrease in impurity with
the split generating more pure sub-groups
18
Impurity
Very impure group Less impure Minimum impurity
When examples can belong to one of two classes: What is the worst case of impurity?
19
Calculating Impurity
i
ii pp 2log
997.030
16log
30
16
30
14log
30
1422
20
2-class Cases:
What is the impurity of a group in which all examples belong to the same class? Impurity= - 1 log21 - 0 log20 = 0 (lowest possible value)
What is the impurity of a group with 50% in either class? Impurity= -0.5 log20.5 – 0.5 log20.5 =1
(highest possible value)
Minimum impurity
Maximumimpurity
0 log(0) is defined as 021
Calculating Information Gain
997.030
16log
30
16
30
14log
30
1422
impurity
787.017
4log
17
4
17
13log
17
1322
impurity
Entire population (30 instances)
Balance<50K
Balance>=50K
17 instances
13 instances
(Weighted) Average Impurity of Children = 615.0391.030
13787.0
30
17
Information Gain= 0.997 - 0.615 = 0.382
391.013
12log
13
12
13
1log
13
122
impurity
Information Gain = Impurity (parent) – Impurity (children)
22
Information Gain of a Split
The weighted average of the impurity of the children nodes after a split is compared with the impurity of the parent node to see how much information is gained through the split.
A child node with fewer examples in it has a lower weight
23
Information Gain = Impurity (parent) – Impurity (children)
Impurity(A) =0.997Impurity(B,C) = 0.615
Gain=0.382
Gain=0.381
Age<45
Age>=45
Balance<50K
Balance>=50KEntire population
Impurity(D,E) =0.406
Information Gain
Impurity(B) = 0.787
Impurity (C)= 0.391
Impurity(D)=0 Log20 +1 log21=0
Impurity(E) =-3/7 Log23/7 -
4/7Log24/7=0.985
A
B
C
D
E
24
Which attribute to split over?
At each node examine splits over each of the attributes
Select the attribute for which the maximum information gain is obtained For a continuous attribute, also need to consider different
ways of splitting (>50 or <=50; >60 or <=60)
For a categorical attribute with lots of possible values, sometimes also need to consider how to group these values ( branch 1 corresponds to {A,B,E} and branch 2 corresponds to {C,D,F,G})
25
Calculating the Information Gain of a Split
26
1. For each sub-group produced by the split, calculate the impurity/entropy of that subset.
2. Calculate the weighted impurity of the split by weighting each sub-group’s impurity by the proportion of training examples (out of the training examples in the parent node) that are in that subset.
3. Calculate the impurity of the parent node, and subtract the weighted impurity of the child nodes to obtain the information gain for the split.
Note: If impurity is increasing with the split then the tree is getting worse, so we wouldn’t want to make the split ! So, information gain of the split needs to be positive in order to choose to split. 26
Person Hair Length
Weight Age Class
Homer 0” 250 36 M
Marge 10” 150 34 F
Bart 2” 90 10 M
Lisa 6” 78 8 F
Maggie 4” 20 1 F
Abe 1” 170 70 M
Selma 8” 160 41 F
Otto 10” 180 38 M
Krusty 6” 200 45 M
Comic 8” 290 38 ?27
Hair Length <= 5?yes no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911
Entropy(1F,3M) = -(1/4)log2(1/4) - (3/4)log
2(3/4)
= 0.8113
Entropy(3F,2M) = -(3/5)log2(3/5) - (2/5)log
2(2/5)
= 0.9710
Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911
Let us try splitting on Hair length
Let us try splitting on Hair length
Gain= Entropy of parent – Weighted average of entropies of the children
28
Weight <= 160?yes no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911
Entropy(4F,1M) = -(4/5)log2(4/5) - (1/5)log
2(1/5)
= 0.7219
Entropy(0F,4M) = -(0/4)log2(0/4) - (4/4)log
2(4/4)
= 0
Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
Let us try splitting on Weight
Let us try splitting on Weight
29
age <= 40?yes no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911
Entropy(3F,3M) = -(3/6)log2(3/6) - (3/6)log
2(3/6)
= 1
Entropy(1F,2M) = -(1/3)log2(1/3) - (2/3)log
2(2/3)
= 0.9183
Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
Let us try splitting on Age
Let us try splitting on Age
30
Weight <= 160?yes no
Hair Length <= 2?yes no
Of the 3 features we had, Weight was the best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified… So we simply continue splitting!
This time we find that we can split on Hair length, and then we are done!
Aha, I got a tree!Note: the splitting decision is not only to choose attribute, but to choose the value to split for a continuous attribute (e.g. Hair <=5 or Hair <=2) 31
Building a Tree - Stopping Criteria
You can stop building the tree when: The impurity of all nodes is zero: Problem is that this
tends to lead to bushy, highly-branching trees, often with one example at each node.
No split achieves a significant gain in purity (information gain not high enough)
Node size is too small: That is, there are less than a certain number of examples, or proportion of the training set, at each node.
32
Reading Rules off the Decision Tree
IF Income=High AND Gender=Male AND Children=Few THEN Non-Responder
For each leaf in the tree, read the rule from the root to that leaf.You will arrive at a set of rules.
Income
Debts
GenderChildren
Responder
Non-Responder
Non-Responder
Low
Male
High
Low
High
Female
Many
Few
IF Income=Low AND Debts=Low THEN Non-Responder
IF Income=Low AND Debts=High THEN Responder
IF Income=High AND Gender=Male AND Children=Many THEN Responder
Non-Responder
Responder
IF Income=High AND Gender=Female THEN Non-Responder
33
Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
If the accuracy on testing data is acceptable, we can use the model to classify instances whose class labels are not known
34
There are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets.
Glasses
YesNo
Hair
Female
ShortLong
Male Female
Hair
Female
ShortLong
Male
Name Hair Glasses ClassMary Long No FemaleMike Short No MaleBill Short No MaleJane Long No FemaleAnn Short Yes Female
Hair Glasses Tree 1 Tree 2 TRUEShort Yes Female Male MaleShort No Male Male FemaleLong No Female Female FemaleShort Yes Female Male Male
Error: 75% 25%
Training
Testing
Tree 1
Tree 2
35
Overfitting & Underfitting
Notice how the error rate on the testing data increases for overly large trees.
Overfitting: the model performs poorly on new examples (e.g. testing examples) as it is too highly trained to the specific training examples (pick up patterns and noises).
Underfitting: the model performs poorly on new examples as it is too simplistic to distinguish between them (i.e. has not picked up the important patterns from the training examples)
underfitting overfitting
36
PruningA decision trees is typically more accurate on its training data than on its test data. Removing branches from a tree can often improve its accuracy on a test set - so-called ‘reduced error pruning’. The intention of this pruning is to cut off branches from the tree when this improves performance on test data - this reduces overfitting and makes the tree more general.
Small is beautiful.
37
Decision tree A tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases Tree construction
At start, all the training examples are at the root Partition examples recursively based on selected attributes
Tree pruning Identify and remove branches that reflect noise or outliers To avoid overfitting
Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree
Decision Tree Classification in a Nutshell
38
Strengths & Weaknesses In practice: One of the most popular method. Why?
Very comprehensible – the tree structure specifies the entire decision structure
Easy for decision makers to understand model’s rational Map nicely to a set of business rules
Relatively easy to implement Very fast to run (to classify examples) with large data sets Good at handling missing values: just treat “missing” as a
value – can become a good predictor Weakness
Bad at handling continuous data, good at categorical input and output. Continuous output: high error rate Continuous input: ranges may introduce bias
39
Decision Tree VariationsRegression Trees
The leaves of a regression tree predict the average value for examples that reach that node.
Age
TrustFund
RetirementFund
<18
No
Yes
>65Yes
No
Professional18-65
Average Income = $10,000 p.a.Standard Deviation = $1,000
Yes
No
Average Income = $100 p.a.Standard Deviation = $10
Average Income = $100,000 p.a.Standard Deviation = $5,000
Average Income = $30,000 p.a.Standard Deviation = $2,000
Average Income = $20,000 p.a.Standard Deviation = $800
Average Income = $1,000 p.a.Standard Deviation = $150
40
Decision Tree VariationsModel Trees
The leaves of a model tree specify a function that can be used to predict the value of examples that reach that node.
Age
TrustFund
RetirementFund
<18
No
Yes
>65Yes
No
Professional18-65
Income Per Annum = Number Of Trust Funds * $5,000
Yes
No
Income Per Annum = Age * $50
Income Per Annum = $80,000 + (Age * $2,000)
Income Per Annum = FundValue * 10%
Income Per Annum = $100 * Number of Children
Income Per Annum = $20,000 + (Age * $1,000)
41
Classification Tree vs. Regression Tree Classification Tree:
Categorical output The leaves of the tree assign a class label or a probability of being in a
class Achieve nodes such that one class is predominant at each node.
Regression Tree: Numeric output The leaves of the tree assign an average value (regression trees) or
specify a function that can be used to compute a value (model trees) for examples that reach that node.
Achieve nodes such that means between nodes vary as much as possible and standard-deviation or variance within each node is as low as possible.
42
Different Decision Tree Algorithms ID3, ID4, ID5, C4.0, C4.5, C5.0, ACLS, and
ASSISTANT: Use information gain as splitting criterion
CART (Classification And Regression Trees): Uses Gini diversity index as measure of impurity when deciding
splitting.
CHAID: A statistical approach that uses the Chi-squared test when
deciding on the best split.
Hunt’s Concept Learning System (CLS), and MINIMAX: Minimizes the cost of classifying examples correctly or
incorrectly. 43
10-fold Cross Validation
Break data into 10 sets of size n/10. Train on 9 datasets and test on 1. Repeat 10 times and take a mean
accuracy.
44
45
SAS: % Captured Response
46
SAS: Lift Value
In SAS, %Response = Percentage of Responses in the top n% ranked individuals. It should be relatively high in the top deciles. And a decreasing plotted curve indicates a good model. The lift chart captures the same informationon a different scale.
Case Discussion Fleet
1. How many input variables are used to build the tree? How many show up in the tree built? Why?
2. How can the tree built be used for segmentation?
3. How can the new campaign results help enhance the tree?
HIV1. How does the pruning process work?
2. How does decision tree pick the interactions among variables?
47
Exercise – Decision TreeCustom
er ID
Student
Credit Rating
Class: Buy PDA
1 No Fair No
2 No Excellent No
3 No Fair Yes
4 No Fair Yes
5 Yes Fair Yes
6 Yes Excellent No
7 Yes Excellent Yes
8 No Excellent No
Which attribute to split on first?
log2(2/3) = -0.585, log2(1/3) = -1.585, log2(1/2) = -1, log2(3/5) = -0.737, Log2(2/5) = -1.322, log2(1/4) = -2, log2(3/4) = -0.415
48