Top Banner
Decision Trees Reading: Textbook, “Learning From Examples”, Section 3
70

Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Jan 18, 2018

Download

Documents

Decision Trees Target concept: “Good days to play tennis” Example: Classification?
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Decision Trees

Reading: Textbook, “Learning From Examples”, Section 3

Page 2: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Day Outlook Temp Humidity Wind PlayTennis

D1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

Training data:

Page 3: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Decision Trees

• Target concept: “Good days to play tennis”

• Example: <Outlook = Sunny, Temperature = Hot, Humidity =

High, Wind = Strong>

Classification?

Page 4: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

• How can good decision trees be automatically constructed?

• Would it be possible to use a “generate-and-test” strategy to find a correct decision tree?

– I.e., systematically generate all possible decision trees, in order of size, until a correct one is generated.

Page 5: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

• Why should we care about finding the simplest (i.e., smallest) correct decision tree?

Page 6: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Decision Tree Induction

• Goal is, given set of training examples, construct decision tree that will classify those training examples correctly (and, hopefully, generalize)

• Original idea of decision trees developed in 1960s by psychologists Hunt, Marin, and Stone, as model of human concept learning. (CLS = “Concept Learning System”)

• In 1970s, AI researcher Ross Quinlan used this idea for AI concept learning: – ID3 (“Itemized Dichotomizer 3”), 1979

Page 7: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

The Basic Decision Tree Learning Algorithm(ID3)

1. Determine which attribute is, by itself, the most useful one for distinguishing the two classes over all the training data. Put it at the root of the tree.

Page 8: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Outlook

Page 9: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

The Basic Decision Tree Learning Algorithm(ID3)

1. Determine which attribute is, by itself, the most useful one for distinguishing the two classes over all the training data. Put it at the root of the tree.

2. Create branches from the root node for each possible value of this attribute. Sort training examples to the appropriate value.

Page 10: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Outlook

Sunny Overcast Rain

D1, D2, D8D9, D11

D3, D7, D12D13

D4, D5, D6D10, D14

Page 11: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

The Basic Decision Tree Learning Algorithm(ID3)

1. Determine which attribute is, by itself, the most useful one for distinguishing the two classes over all the training data. Put it at the root of the tree.

2. Create branches from the root node for each possible value of this attribute. Sort training examples to the appropriate value.

3. At each descendant node, determine which attribute is, by itself, the most useful one for distinguishing the two classes for the corresponding training data. Put that attribute at that node.

Page 12: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Outlook

Sunny Overcast Rain

Humidity

Yes

Wind

Page 13: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

The Basic Decision Tree Learning Algorithm(ID3)

1. Determine which attribute is, by itself, the most useful one for distinguishing the two classes over all the training data. Put it at the root of the tree.

2. Create branches from the root node for each possible value of this attribute. Sort training examples to the appropriate value.

3. At each descendant node, determine which attribute is, by itself, the most useful one for distinguishing the two classes for the corresponding training data. Put that attribute at that node.

4. Go to 2, but for the current node.

Note: This is greedy search with no backtracking

Page 14: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

The Basic Decision Tree Learning Algorithm(ID3)

1. Determine which attribute is, by itself, the most useful one for distinguishing the two classes over all the training data. Put it at the root of the tree.

2. Create branches from the root node for each possible value of this attribute. Sort training examples to the appropriate value.

3. At each descendant node, determine which attribute is, by itself, the most useful one for distinguishing the two classes for the corresponding training data. Put that attribute at that node.

4. Go to 2, but for the current node.

Note: This is greedy search with no backtracking

Page 15: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

How to determine which attribute is the best classifier for a set of training examples?

E.g., why was Outlook chosen to be the root of the tree?

Page 16: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

“Impurity” of a split

• Task: classify as Female or Male

• Instances: Jane, Mary, Alice, Bob, Allen, Doug

• Each instance has two binary attributes: “wears lipstick” and “has long hair”

Page 17: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

“Impurity” of a split

Wears lipstick

T F

Jane, Mary, Alice

Pure split Impure split

Bob, Allen, Doug

T F

Jane, Mary, Bob Alice, Allen, Doug

Has long hair

For the each node of the tree we want to choose attribute that gives purest split.

But how to measure degree of impurity of a split ?

Page 18: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Entropy• Let S be a set of training examples.

p+ = proportion of positive examples. p− = proportion of negative examples

• Entropy measures the degree of uniformity or non-uniformity in a collection.

• Roughly measures how predictable collection is, only on basis of distribution of + and − examples.

Page 19: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Entropy

• When is entropy zero?

• When is entropy maximum, and what is its value?

Page 20: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.
Page 21: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

• Entropy gives minimum number of bits of information needed to encode the classification of an arbitrary member of S.

– If p+ = 1, don’t need any bits (entropy 0)

– If p+ = .5, need one bit (+ or -)

– If p+ = .8, can encode collection of {+,-} values using on average less than 1 bit per value

• Can you explain how we might do this?

Page 22: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Entropy of each branch?

Wears lipstick

T F

Jane, Mary, Alice

Pure split Impure split

Bob, Allen, Doug

T F

Jane, Mary, Bob Alice, Allen, Doug

Has long hair

Page 23: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Day Outlook Temp Humidity Wind PlayTennis

D1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

What is the entropy of the “Play Tennis” training set?

Page 24: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

• Suppose you’re now given a new example. In absence of any additional information, what classification should you guess?

Page 25: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

What is the average entropy of the “Humidity” attribute?

Page 26: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.
Page 27: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

In-class exercise:

• Calculate information gain of the “Outlook” attribute.

Page 28: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Formal definition of Information Gain

Page 29: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Day Outlook Temp Humidity Wind PlayTennis

D1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

Page 30: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Operation of ID3

1. Compute information gain for each attribute.

Outlook

Temperature

Humidity

Wind

Page 31: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.
Page 32: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

ID3’s Inductive Bias

• Given a set of training examples, there are typically many decision trees consistent with that set.

– E.g., what would be another decision tree consistent with the example training data?

• Of all these, which one does ID3 construct?

– First acceptable tree found in greedy search

Page 33: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

ID3’s Inductive Bias, continued

• Algorithm does two things:

– Favors shorter trees over longer ones

– Places attributes with highest information gain closest to root.

• What would be an algorithm that explicitly constructs the shortest possible tree consistent with the training data?

Page 34: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

ID3’s Inductive Bias, continued

• ID3: Efficient approximation to “find shortest tree” method

• Why is this a good thing to do?

Page 35: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Overfitting

• ID3 grows each branch of the tree just deeply enough to perfectly classify the training examples.

• What if number of training examples is small?

• What if there is noise in the data?

• Both can lead to overfitting– First case can produce incomplete tree– Second case can produce too-complicated tree.

But...what is bad about over-complex trees?

Page 36: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Overfitting, continued

• Formal definition of overfitting:

– Given a hypothesis space H, a hypothesis h H is said to overfit the training data if there exists some alternative h’ H, such that

TrainingError(h) < TrainingError(h’),

but

TestError(h’) < TestError(h).

Page 37: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Overfitting, continuedA

ccur

acy

Size of tree (number of nodes)

test data

training data

Medical data set

Page 38: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Overfitting, continued

• How to avoid overfitting:

– Stop growing the tree early, before it reaches point of perfect classification of training data.

– Allow tree to overfit the data, but then prune the tree.

Page 39: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Pruning a Decision Tree

• Pruning: – Remove subtree below a decision node.

– Create a leaf node there, and assign most common classification of the training examples affiliated with that node.

– Helps reduce overfitting

Page 40: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Day Outlook Temp Humidity Wind PlayTennis

D1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong NoD15 Sunny Hot Normal Strong No

Training data:

Page 41: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Example Outlook

Sunny Overcast Rain

Humidity

Yes

Wind

High Normal

Strong Weak

Temperature

Yes

Hot Mild Cool

No

Yes

Yes

No

No

Page 42: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Example Outlook

Sunny Overcast Rain

Humidity

Yes

Wind

High Normal

Strong Weak

Temperature

Yes

Hot Mild Cool

No

Yes

Yes

No

No

Page 43: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Example Outlook

Sunny Overcast Rain

Humidity

Yes

Wind

High Normal

Strong Weak

Temperature

Yes

Hot Mild Cool

No

Yes

Yes

No

No D9 Sunny Cool Normal Weak YesD11 Sunny Mild Normal Strong YesD15 Sunny Hot Normal Strong No

Page 44: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Example Outlook

Sunny Overcast Rain

Humidity

Yes

Wind

High Normal

Strong Weak

Temperature

Yes

Hot Mild Cool

No

Yes

Yes

No

No D9 Sunny Cool Normal Weak YesD11 Sunny Mild Normal Strong YesD15 Sunny Hot Normal Strong No

Majority: Yes

Page 45: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Example Outlook

Sunny Overcast Rain

Humidity

Yes

Wind

High Normal

Strong Weak

YesNo

No Yes

Page 46: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

How to decide which subtrees to prune?

Page 47: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

How to decide which subtrees to prune?

Need to divide data into:Training setPruning (validation) setTest set

Page 48: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

• Reduced Error Pruning:

– Consider each decision node as candidate for pruning.

– For each node, try pruning node. Measure accuracy of pruned tree over pruning set.

– Select single-node pruning that yields best increase in accuracy over pruning set.

– If no increase, select one of the single-node prunings that does not decrease accuracy.

– If all prunings decrease accuracy, then don’t prune. Otherwise, continue this process until further pruning is harmful.

Page 49: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Simple validation

• Split training data into training set and validation set.

• Use training set to train model with a given set of parameters (e.g., # training epochs). Then use validation set to predict generalization accuracy.

• Finally, use separate test set to test final classifier.

validation

training

training timeor nodes prunedor...

Error rate

stop training/pruning/... here

Page 50: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Miscellaneous

• If you weren’t here last time, see me during the break

• Graduate students (545) sign up for paper presentations– This is optional for undergrads (445)– Two volunteers for Wednesday April 17

• Coursepack on reserve in library

• Course mailing list: [email protected]

Page 51: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Today

• Decision trees

• ID3 algorithm for constructing decision trees

• Calculating information gain

• Overfitting

• Reduced error pruning

• pruning

• Continuous attribute values

• Gain ratio

• UCI ML Repository

• Optdigits data set

• C4.5

• Evaluating classifiers

• Homework 1

Recap from last time

Page 52: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Day Outlook Temp Humidity Wind PlayTennis

D1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

Exercise: What is information gain of Wind?

E(S) = .94

Page 53: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Continuous valued attributes

• Original decision trees: Two discrete aspects:

– Target class (e.g., “PlayTennis”) has discrete values

– Attributes (e.g., “Temperature”) have discrete values

• How to incorporate continuous-valued decision attributes? – E.g., Temperature [0,100]

Page 54: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Continuous valued attributes, continued

• Create new attributes, e.g., Temperaturec true if Temperature >= c, false otherwise.

• How to choose c? – Find c that maximizes information gain.

Page 55: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Day Outlook Temp Humidity Wind PlayTennis

D1 Sunny 85 High Weak NoD2 Sunny 72 High Strong NoD3 Overcast 62 High Weak YesD4 Rain 60 High Weak YesD5 Rain 20 Normal Strong NoD6 Rain 10 Normal Weak Yes

Training data:

Page 56: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

• Sort examples according to values of Temperature found in training setTemperature: 10 20 60 62 72 85PlayTennis: Yes No Yes Yes No No

• Find adjacent examples that differ in target classification.

• Choose candidate c as midpoint of the corresponding interval. – Can show that optimal c must always lie at such a

boundary.

• Then calculate information gain for each candidate c.

• Choose best one.

• Put new attribute Temperaturec in pool of attributes.

Page 57: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Example

Temperature: 10 20 60 62 72 85PlayTennis: Yes No Yes Yes No No

Page 58: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Example

Temperature: 10 20 60 62 72 85PlayTennis: Yes No Yes Yes No No

Page 59: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Example

Temperature: 10 20 60 62 72 85PlayTennis: Yes No Yes Yes No No

c =15 c =40 c =67

Page 60: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Example

Temperature: 10 20 60 62 72 85PlayTennis: Yes No Yes Yes No No

c =15 c =40 c =67

Define new attribute: Temperature15 , with

Values(Temperature15) = { <15 , >=15}

Page 61: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Day Outlook Temp Humidity Wind PlayTennis

D1 Sunny >=15 High Weak NoD2 Sunny >=15 High Strong NoD3 Overcast >=15 High Weak YesD4 Rain >=15 High Weak YesD5 Rain >=15 Normal Strong NoD6 Rain <15 Normal Weak Yes

Training data:

What is Gain(S, Temperature15)?

Page 62: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

• All nodes in decision tree are of the form

Ai

Threshold < Threshold

Page 63: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Alternative measures for selecting attributes

• Recall intuition behind information gain measure:– We want to choose attribute that does the most work in

classifying the training examples by itself.

– So measure how much information is gained (or how much entropy decreased) if that attribute is known.

Page 64: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

• However, information gain measure favors attributes with many values.

• Extreme example: Suppose that we add attribute “Date” to each training example. Each training example has a different date.

Page 65: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Day Date Outlook Temp Humidity Wind PlayTennis

D1 3/1 Sunny Hot High Weak NoD2 3/2 Sunny Hot High Strong NoD3 3/3 Overcast Hot High Weak YesD4 3/4 Rain Mild High Weak YesD5 3/5 Rain Cool Normal Weak YesD6 3/6 Rain Cool Normal Strong NoD7 3/7 Overcast Cool Normal Strong YesD8 3/8 Sunny Mild High Weak NoD9 3/9 Sunny Cool Normal Weak YesD10 3/10 Rain Mild Normal Weak YesD11 3/11 Sunny Mild Normal Strong YesD12 3/12 Overcast Mild High Strong YesD13 3/13 Overcast Hot Normal Weak YesD14 3/14 Rain Mild High Strong No

Gain (S, Outlook) = .94 - .694 = .246

What is Gain (S, Date)?

Page 66: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

• Date will be chosen as root of the tree.

• But of course the resulting tree will not generalize

Page 67: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Gain Ratio• Quinlan proposed another method of selecting attributes, called “gain

ratio”:

Suppose attribute A splits the training data S into m subsets. Call the subsets S1, S2, ..., Sm.

We can define a set:

The Penalty Term is the entropy of this set.

For example: What is the Penalty Term for the “Date” attribute? How about for “Outlook”?

Page 68: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Day Date Outlook Temp Humidity Wind PlayTennis

D1 3/1 Sunny Hot High Weak NoD2 3/2 Sunny Hot High Strong NoD3 3/3 Overcast Hot High Weak YesD4 3/4 Rain Mild High Weak YesD5 3/5 Rain Cool Normal Weak YesD6 3/6 Rain Cool Normal Strong NoD7 3/7 Overcast Cool Normal Strong YesD8 3/8 Sunny Mild High Weak NoD9 3/9 Sunny Cool Normal Weak YesD10 3/10 Rain Mild Normal Weak YesD11 3/11 Sunny Mild Normal Strong YesD12 3/12 Overcast Mild High Strong YesD13 3/13 Overcast Hot Normal Weak YesD14 3/14 Rain Mild High Strong No

Page 70: Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

Homework 1

• How to download homework and data

• Demo of C4.5

• Accounts on Linuxlab?

• How to get to Linux Lab

• Need help on Linux?

• Newer version C5.0: http://www.rulequest.com/see5-info.html