CHAPTER 9: Decision Trees Slides corrected and new slides added by Ch. Eick.

CHAPTER 9:

Decision Trees

Slides corrected and new slides added by Ch. Eick

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)2

Tree Uses Nodes, and Leaves

3

Divide and Conquer

Internal decision nodes Univariate: Uses a single attribute, xi

Numeric xi : Binary split : xi > wm

Discrete xi : n-way split for n possible values Multivariate: Uses all attributes, x

Leaves Classification: Class labels, or proportions Regression: Numeric; r average, or local fit

Learning is greedy; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993) without backtracking (decision taken are final)

Side Discussion “Greedy Algorithms”

Fast and therefore attractive to solve NP-hard and other problems with high complexity. Later decisions are made in the context of decision selected early dramatically reducing the size of the search space.

They do not backtrack: if they make a bad decision (based on local criteria), they never revise the decision.

They are not guaranteed to find the optimal solutions, and sometimes can get deceived and find really bad solutions.

In spite of what is said above, a lot successful and popular algorithms in Computer Science are greedy algorithms.

Greedy algorithms are particularly popular in AI and Operations Research.

Popular Greedy Algorithms: Decision Tree Induction,…


Classification Trees (ID3, CART, C4.5) For node m, Nm instances reach m, Ni

m belong to Ci

Node m is pure if pim is 0 or 1

Measure of impurity is entropy

m

imi

mi NN

pm,CP̂ x|

K

i

im

imm pp

12logI


Best Split

If node m is pure, generate a leaf and stop, otherwise split and continue recursively

Impurity after split: Nmj of Nm take branch j. Nimj belong

to Ci

Find the variable and split that min impurity (among all variables -- and split positions for numeric variables)

mj

imji

mji N

Npj,m,CP̂ x|

K

i

imj

imj

n

j m

mjm pp

N

N

12

1

logI'

skip

Tree Induction

Greedy strategy. Split the records based on an attribute test that

optimizes certain criterion.

Issues Determine how to split the records

How to specify the attribute test condition? How to determine the best split?

Determine when to stop splitting

Information Gain vs. Gain Ratiosale custId car age city newCar

c1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no

D=(2/3,1/3)

D1(1,0) D2(1/3,2/3)

D=(2/3,1/3)

D1(0,1) D2(2/3,1/3) D3(1,0)

D=(2/3,1/3)

D1(1,0) D2(0,1) D3(1,0) D4(1,0) D5(1,0) D6(0,1)

city=sf city=la

car=merc car=taurus car=van

age=22 age=25 age=27 age=35 age=40 age=50

Gain(D,city=)= H(1/3,2/3) – ½ H(1,0) – ½ H(1/3,2/3)=0.45

Gain(D,car=)= H(1/3,2/3) – 1/6 H(0,1) – ½ H(2/3,1/3) – 1/3 H(1,0)=0.45

Gain(D,age=)= H(1/3,2/3) – 6*1/6 H(0,1) = 0.90

G_Ratio_pen(city=)=H(1/2,1/2)=1

G_Ratio_pen(age=)=log2(6)=2.58

G_Ratio_pen(car=)=H(1/2,1/3,1/6)=1.45

Result:I_Gain_Ratio:city>age>car

Result:I_Gain:age > car=city

How to determine the Best Split?

OwnCar?

C0: 6C1: 4

C0: 4C1: 6

C0: 1C1: 3

C0: 8C1: 0

C0: 1C1: 7

CarType?

C0: 1C1: 0

C0: 1C1: 0

C0: 0C1: 1

StudentID?

...

Yes No Family

Sports

Luxury c1c10

c20

C0: 0C1: 1

...

c11

Before Splitting: 10 records of class 0,10 records of class 1

Before: E(1/2,1/2)

After: 4/20*E(1/4,3/4) + 8/20*E(1,0) + 8/20*E(1/8,7/8)Gain: Before-AfterPick Test that has the highest gain!Remark: E stands for Gini, Entropy(H), Impurity(1-maxc(P(c )), Gain-ratio

Entropy and Gain ComputationsAssume we have m classes in our classification problem. A test S

subdivides the examples D= (p1,…,pm) into n subsets D1 =(p11,…,p1m) ,…,Dn =(p11,…,p1m). The qualify of S is evaluated using Gain(D,S) (ID3) or GainRatio(D,S) (C5.0):Let H(D=(p1,…,pm))= i=1 (pi log2(1/pi)) (called the entropy

function)Gain(D,S)= H(D) i=1 (|Di|/|D|)*H(Di)

Gain_Ratio(D,S)= Gain(D,S) / H(|D1|/|D|,…, |Dn|/|D|)

Remarks: |D| denotes the number of elements in set D.D=(p1,…,pm) implies that p1+…+ pm =1 and indicates that of the |D|

examples p1*|D| examples belong to the first class, p2*|D| examples belong to the second class,…, and pm*|D| belong the m-th (last) class.

H(0,1)=H(1,0)=0; H(1/2,1/2)=1, H(1/4,1/4,1/4,1/4)=2, H(1/p,…,1/p)=log2(p).

C5.0 selects the test S with the highest value for Gain_Ratio(D,S), whereas ID3 picks the test S for the examples in set D with the highest value for Gain (D,S).

m

n

Information Gain vs. Gain Ratiosale custId car age city newCar

c1 taurus 27 sf yesc2 van 35 la yesc3 van 40 sf yesc4 taurus 22 sf yesc5 merc 50 la noc6 taurus 25 la no

D=(2/3,1/3)

D1(1,0) D2(1/3,2/3)

D=(2/3,1/3)

D1(0,1) D2(2/3,1/3) D3(1,0)

D=(2/3,1/3)

D1(1,0) D2(0,1) D3(1,0) D4(1,0) D5(1,0) D6(0,1)

city=sf city=la

car=merc car=taurus car=van

age=22 age=25 age=27 age=35 age=40 age=50

Gain(D,city=)= H(1/3,2/3) – ½ H(1,0) – ½ H(1/3,2/3)=0.45

Gain(D,car=)= H(1/3,2/3) – 1/6 H(0,1) – ½ H(2/3,1/3) – 1/3 H(1,0)=0.45

Gain(D,age=)= H(1/3,2/3) – 6*1/6 H(0,1) = 0.90

G_Ratio_pen(city=)=H(1/2,1/2)=1

G_Ratio_pen(age=)=log2(6)=2.58

G_Ratio_pen(car=)=H(1/2,1/3,1/6)=1.45

Result:I_Gain_Ratio:city>age>car

Result:I_Gain:age > car=city

Splitting Continuous Attributes

Different ways of handling Discretization to form an ordinal categorical attribute

Static – discretize once at the beginning Dynamic – ranges can be found by equal interval

bucketing, equal frequency bucketing(percentiles), clustering, or supervised

clustering.

Binary Decision: (A < v) or (A v) consider all possible splits and finds the best cut v can be more compute intensive


Stopping Criteria for Tree Induction1. Grow entire tree

Stop expanding a node when all the records belong to the same class

Stop expanding a node when all the records have the same attribute values

2. Pre-pruning (do not grow complete tree)1. Stop when only x examples are left (pre-pruning)2. … other pre-pruning strategies

How to Address Overfitting in Decision Trees

The most popular approach: Post-pruningGrow decision tree to its entiretyTrim the nodes of the decision tree in a

bottom-up fashion If generalization error improves after

trimming, replace sub-tree by a leaf node.Class label of leaf node is determined from

majority class of instances in the sub-tree

Advantages Decision Tree Based Classification Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Okay for noisy data Can handle both continuous and symbolic attributes Accuracy is comparable to other classification techniques for many

simple data sets Decent average performance over many datasets Kind of a standard—if you want to show that your “new” classification

technique really “improves the world” compare its performance against decision trees (e.g. C 5.0) using 10-fold cross-validation

Does not need distance functions; only the order of attribute values is important for classification: 0.1,0.2,0.3 and 0.331,0.332, and 0.333 is the same for a decision tree learner.

Disadvantages Decision Tree Based Classification

Relies on rectangular approximation that might not be good for some dataset

Selecting good learning algorihtm parameters (e.g. degree of pruning) is non-trivial

Ensemble techniques, support vector machines, and kNN might obtain higher accuracies for a specific dataset.

More recently, forests (ensembles of decision trees) have gained some popularity.

Error in Regressions TreesExamples are:(1,1.37), (2,-1.45), (2.1 -1.35), (2.2,-125)(7, 2.06), (8, 1.66)Error before split: 1/6(2*0**2+ 2*0.01 +2*0.04)In general: minimizes squared error!


Regression Trees

Error at node m:

After splitting:

t

tm

t

ttm

mt

mt mt

mm

mm

b

rbgbgr

NE

m:b

x

xx

xxx

1

otherwise0

node reaches if 1

2

X

t

tmj

t

ttmj

mjj

tmjt mj

t

mm

mjmj

b

rbgbgr

N'E

jm:b

x

xx

xxx

1

otherwise0

branch and node reaches if 1

2

X

Error in Regressions TreesExamples are:(1,1.37), (2,-1.45), (2.1 -1.35), (2.2,-125)(7, 2.06), (8, 1.66)Error before split: 1/6(2*0**2+ 2*0.01 +2*0.04)


Model Selection in Trees:

Summary Regression Trees Regression trees work as decision trees except

Leafs carry numbers which predict the output variable instead of class label Instead of entropy/purity the squared prediction error serves as the

performance measure. Tests that reduce the squared prediction error the most are selected as node

test; test conditions have the form y>threshold where y is an independent variable of the prediction problem.

Instead of using the majority class, leaf labels are generated by averaging over the values of the dependent variable of the examples that are associated with the particular class node

Like decision trees, regression tree employ rectangular tessellations with a fixed number being associated with each rectangle; the number is the output for the inputs which lie inside the rectangle.

In contrast to ordinary regression that performs curve fitting, regression tries use averaging with respect to the output variable when making predictions.


Multivariate Trees

CHAPTER 9: Decision Trees Slides corrected and new slides added by Ch. Eick.

Documents