Top Banner
Decision tree LING 572 Fei Xia 1/16/06
45

Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues In this lecture, “attribute” and “feature” are interchangeable.

Dec 24, 2015

Download

Documents

Marylou Palmer
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Decision tree

LING 572

Fei Xia

1/16/06

Page 2: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Outline

• Basic concepts

• Issues

In this lecture, “attribute” and “feature” are interchangeable.

Page 3: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Basic concepts

Page 4: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Main idea

• Build a tree decision tree– Each node represents a test– Training instances are split at each node

• Greedy algorithm

Page 5: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

A classification problem

District House type

Income Previous

Customer

Outcome(target)

Suburban Detached High No Nothing

Suburban Semi-detached

High Yes Respond

Rural Semi-detached

Low No Respond

Urban Detached Low Yes Nothing

Page 6: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

DistrictSuburban

(3/5)Rural(4/4)

Urban(3/5)

RespondHouse type

Detac

hed

(2/2

)

Nothing

Previous customer

Yes(3/3) N

o (2/2)

Semi-detached

(3/3)

Respond Nothing Respond

Decision tree

Page 7: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Decision tree representation

• Each internal node is a test:– Theoretically, a node can test multiple attributes– In most systems, a node tests exactly one attribute

• Each branch corresponds to test results– A branch corresponds to an attribute value or a range

of attribute values

• Each leaf node assigns – a class: decision tree– a real value: regression tree

Page 8: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

What’s the (a?) best decision tree?

• “Best”: You need a bias (e.g., prefer the “smallest” tree): least depth? Fewest nodes? Which trees are the best predictors of unseen data?

• Occam's Razor: we prefer the simplest hypothesis that fits the data.

Find a decision tree that is as small as possible and fits the data

Page 9: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Finding a smallest decision tree

• A decision tree can represent any discrete function of the inputs: y=f(x1, x2, …, xn)– How many functions are there assuming all the

attributes are binary?

• The space of decision trees is too big for systemic search for a smallest decision tree.

• Solution: greedy algorithm

Page 10: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Basic algorithm: top-down induction

1. Find the “best” decision attribute, A, and assign A as decision attribute for node

2. For each value (?) of A, create a new branch, and divide up training examples

3. Repeat the process 1-2 until the gain is small enough

Page 11: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Major issues

Page 12: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Major issues

Q1: Choosing best attribute: what quality measure to use?

Q2: Determining when to stop splitting: avoid overfitting

Q3: Handling continuous attributes

Page 13: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Other issues

Q4: Handling training data with missing attribute values

Q5: Handing attributes with different costs

Q6: Dealing with continuous goal attribute

Page 14: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Q1: What quality measure

• Information gain

• Gain Ratio

• 2

• Mutual information

• ….

Page 15: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Entropy of a training set

• S is a sample of training examples

• Entropy is one way of measuring the impurity of S

• P(ci) is the proportion of examples in S whose category is ci.

H(S)=-i p(ci) log p(ci)

Page 16: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Information gain

• InfoGain(Y | X): I must transmit Y. How many bits on average would it save me if both ends of the line knew X?

• Definition:

InfoGain (Y | X) = H(Y) – H(Y|X)

• Also written as InfoGain (Y, X)

Page 17: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Information Gain

• InfoGain(S, A): expected reduction in entropy due to knowing A.

• Choose the A with the max information gain.(a.k.a. choose the A with the min average entropy)

)(||

||)(

)|()()(

)|()(),(

)(a

AValuesa

a

a

SHS

SSH

aASHaApSH

ASHSHASInfoGain

Page 18: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

An example

IncomeHigh Low

S=[9+,5-]E=0.940

[3+,4-] [6+,1-]

InfoGain (S, Income)=0.940-(7/14)*0.985-(7/14)*0.592=0.151

PrevCustYes No

S=[9+,5-]E=0.940

[6+,2-] [3+,3-]

InfoGain(S, Wind)=0.940-(8/14)*0.811-(6/14)*1.0=0.048

E=0.985 E=0.592 E=0.811 E=1.00

Page 19: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Other quality measures

• Problem of information gain:– Information Gain prefers attributes with many values.

• An alternative: Gain Ratio

Where Sa is subset of S for which A has value a.

||

||log||

||)(),(

),(

),(),(

2)( S

S

S

SAHASSplitInfo

ASSplitInfo

ASGainASGainRatio

a

AValuesa

aS

Page 20: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Q2: Avoiding overfitting

• Overfitting occurs when our decision tree characterizes too much detail, or noise in our training data.

• Consider error of hypothesis h over– Training data: ErrorTrain(h)– Entire distribution D of data: ErrorD(h)

• A hypothesis h overfits training data if there is an alternative hypothesis h’, such that– ErrorTrain(h) < ErrorTrain(h’), and– ErrorD(h) > errorD(h’)

Page 21: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

How to avoiding overfitting

• Stop growing the tree earlier. E.g., stop when– InfoGain < threshold– Size of examples in a node < threshold– Depth of the tree > threshold– …

• Grow full tree, then post-prune

In practice, both are used. Some people claim that the latter works better than the former.

Page 22: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Post-pruning

• Split data into training and validation set• Do until further pruning is harmful:

– Evaluate impact on validation set of pruning each possible node (plus those below it)

– Greedily remove the ones that don’t improve the performance on validation set

Produces a smaller tree with best performance measure

Page 23: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Performance measure

• Accuracy:– on validation data– K-fold cross validation

• Misclassification cost: Sometimes more accuracy is desired for some classes than others.

• MDL: size(tree) + errors(tree)

Page 24: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Rule post-pruning

• Convert tree to equivalent set of rules

• Prune each rule independently of others

• Sort final rules into desired sequence for use

• Perhaps most frequently used method (e.g., C4.5)

Page 25: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Q3: handling numeric attributes

• Continuous attribute discrete attribute

• Example– Original attribute: Temperature = 82.5– New attribute: (temperature > 72.3) = t, f

Question: how to choose split points?

Page 26: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Choosing split points for a continuous attribute

• Sort the examples according to the values of the continuous attribute.

• Identify adjacent examples that differ in their target labels and attribute values a set of candidate split points

• Calculate the gain for each split point and choose the one with the highest gain.

Page 27: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Q4: Unknown attribute values

Possible solutions:

• Assume an attribute can take the value “blank”.

• Assign most common value of A among training data at node n.

• Assign most common value of A among training data at node n which have the same target class.

• Assign prob pi to each possible value vi of A– Assign a fraction (pi) of example to each descendant in tree– This method is used in C4.5.

Page 28: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Q5: Attributes with cost

• Ex: Medical diagnosis (e.g., blood test) has a cost

• Question: how to learn a consistent tree with low expected cost?

• One approach: replace gain by – Tan and Schlimmer (1990)

)(

),(2

ACost

ASGain

Page 29: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Q6: Dealing with continuous target attribute Regression tree

• A variant of decision trees

• Estimation problem: approximate real-valued functions: e.g., the crime rate

• A leaf node is marked with a real value or a linear function: e.g., the mean of the target values of the examples at the node.

• Measure of impurity: e.g., variance, standard deviation, …

Page 30: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Summary of Major issues

Q1: Choosing best attribute: different quality measures.

Q2: Determining when to stop splitting: stop earlier or post-pruning

Q3: Handling continuous attributes: find the breakpoints

Page 31: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Summary of other issues

Q4: Handling training data with missing attribute values: blank value, most common value, or fractional count

Q5: Handing attributes with different costs: use a quality measure that includes the cost factors.

Q6: Dealing with continuous goal attribute: various ways of building regression trees.

Page 32: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Common algorithms

• ID3

• C4.5

• CART

Page 33: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

ID3

• Proposed by Quinlan (so is C4.5)

• Can handle basic cases: discrete attributes, no missing information, etc.

• Information gain as quality measure

Page 34: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

C4.5

• An extension of ID3:– Several quality measures– Incomplete information (missing attribute

values)– Numerical (continuous) attributes– Pruning of decision trees– Rule derivation– Random mood and batch mood

Page 35: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

CART

• CART (classification and regression tree)

• Proposed by Breiman et. al. (1984)

• Constant numerical values in leaves

• Variance as measure of impurity

Page 36: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Summary

• Basic case: – Discrete input attributes– Discrete target attribute– No missing attribute values– Same cost for all tests and all kinds of

misclassification.

• Extended cases:– Continuous attributes– Real-valued target attribute– Some examples miss some attribute values– Some tests are more expensive than others.

Page 37: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Summary (cont)

• Basic algorithm: – greedy algorithm– top-down induction– Bias for small trees

• Major issues: Q1-Q6

Page 38: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Strengths of decision tree

• Simplicity (conceptual)

• Efficiency at testing time

• Interpretability: Ability to generate understandable rules

• Ability to handle both continuous and discrete attributes.

Page 39: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Weaknesses of decision tree

• Efficiency at training: sorting, calculating gain, etc.

• Theoretical validity: greedy algorithm, no global optimization

• Predication accuracy: trouble with non-rectangular regions

• Stability and robustness• Sparse data problem: split data at each

node.

Page 40: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Addressing the weaknesses

• Used in classifier ensemble algorithms:– Bagging– Boosting

• Decision tree stub: one-level DT

Page 41: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Coming up

• Thursday: Decision list

• Next week: Feature selection and bagging

Page 42: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Additional slides

Page 43: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Classification and estimation problems

• Given– a finite set of (input) attributes: features

• Ex: District, House type, Income, Previous customer

– a target attribute: the goal• Ex: Outcome: {Nothing, Respond}

– training data: a set of classified examples in attribute-value representation

• Predict the value of the goal given the values of input attributes– The goal is a discrete variable classification problem– The goal is a continuous variable estimation problem

Page 44: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Bagging

• Introduced by Breiman

• It first creates multiple decision trees based on different training sets.

• Then, it compares each tree and incorporates the best features of each.

• This addresses some of the problems inherent in regular ID3.

Page 45: Decision tree LING 572 Fei Xia 1/16/06. Outline Basic concepts Issues  In this lecture, “attribute” and “feature” are interchangeable.

Boosting

• Introduced by Freund and Schapire

• It examines the trees that incorrectly classify an instance and assign them a weight.

• These weights are used to eliminate hypotheses or refocus the algorithm on the hypotheses that are performing well.