1 Ch 3. Decision Tree Learning Decision trees Basic learning algorithm (ID3) Entropy, information gain Hypothesis space Inductive bias Occam’s razor in general Overfitting problem & extensions post-pruning real values, missing values, attribute costs, …
44
Embed
1 Ch 3. Decision Tree Learning Decision trees Basic learning algorithm (ID3) Entropy, information gain Hypothesis space Inductive bias Occam’s.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Ch 3. Decision Tree Learning
Decision trees Basic learning algorithm (ID3)
Entropy, information gain Hypothesis space Inductive bias
Occam’s razor in general Overfitting problem & extensions
post-pruning real values, missing values, attribute costs, …
2
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
3
Training Examples
Day Outlook Temp. Humidity Wind Play TennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Weak YesD8 Sunny Mild High Weak NoD9 Sunny Cold Normal Weak YesD10 Rain Mild Normal Strong YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No
4
Decision Tree Learning
Classify instances sorting them down the tree to a leaf node
containing the class (value) based on attributes of instances branch for each value
Widely used, practical method of approximating discrete-valued functions
Robust to noisy data !! Capable of learning disjunctive expressions Typical bias: prefer smaller trees
5
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
No Yes
Each internal node tests an attribute
Each branch corresponds to anattribute value node
Each leaf node assigns a classification
6
No
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ?
7
Decision Tree for Conjunction
Outlook
Sunny Overcast Rain
Wind
Strong Weak
No Yes
No
Outlook=Sunny Wind=Weak
No
8
Decision Tree for Disjunction
Outlook
Sunny Overcast Rain
Yes
Outlook=Sunny Wind=Weak
Wind
Strong Weak
No Yes
Wind
Strong Weak
No Yes
9
Decision Tree for XOR
Outlook
Sunny Overcast Rain
Wind
Strong Weak
Yes No
Outlook=Sunny XOR Wind=Weak
Wind
Strong Weak
No Yes
Wind
Strong Weak
No Yes
10
Decision Tree
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
• decision trees represent disjunctions of conjunctions
Day Outlook Temp. Humidity Wind Play TennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Weak YesD8 Sunny Mild High Weak NoD9 Sunny Cold Normal Weak YesD10 Rain Mild Normal Strong YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No
Hypothesis space is complete! Target function surely in there…
Outputs a single hypothesis No backtracking on selected attributes (greedy
search) Local minimal (suboptimal splits)
Statistically-based search choices Robust to noisy data
Inductive bias (search bias) Prefer shorter trees over longer ones Place high info. gain attributes close to the root
25
Inductive Bias in ID3
H is the power set of instances X Unbiased ?
Preference for short trees, and for those with high information gain attributes near the root
Bias is a preference for some hypotheses, rather than a restriction of the hypothesis space H
Compare bias to C-E ID3: complete space, incomplete search --> bias from
search strategy C-E: incomplete space, complete search --> bias from
expressive power of H
26
Occam’s Razor
Why prefer short hypotheses? Argument in favor:
Fewer short hypotheses than long hypotheses A short hypothesis that fits the data is unlikely to be a
coincidence A long hypothesis that fits the data might be a
coincidence Argument against:
There are many ways to define small sets of hypotheses
E.g. All trees with a prime number of nodes that use attributes beginning with ”Z”
What is so special about small sets based on size of hypothesis???
27
Overfitting - definition
Consider error of hypothesis h over Training data: errortrain(h) Entire distribution D of data: errorD(h)
Hypothesis hH overfits training data if there is an alternative hypothesis h’H such that
errortrain(h) < errortrain(h’)and
errorD(h) > errorD(h’) Possible reasons: noise and small samples
coincidences possible with small samples noisy data creates a large tree h h’ not fitting it is likely to work better
28
Overfitting in Decision Trees
error
#nodes
validation set
training set
optimal#nodes
29
Avoid Overfitting
How can we avoid overfitting? Stop growing when data split not
statistically significant Grow full tree then post-prune
has been found more successful
Minimum description length (MDL):
Minimize:
size(tree) + size(misclassifications(tree))
30
How to decide tree size?
What criterion to use separate test set to evaluate the use of
pruning use all data, apply statistical test to estimate if
expanding/pruning is likely to produce improvement
use an explicit complexity measure (coding length of data & tree), stop growth when minimized
31
Training/validation sets
Available data split training set: apply learning to this validation set: evaluate result
accuracy impact of pruning ‘safety check’ against overfit
common strategy: 2/3 for training
32
Reduced error pruning
Pruning make an inner node a leaf node assign it the most common class
ProcedureSplit data into training and validation setDo until further pruning is harmful:1. Evaluate impact on validation set of pruning each
possible node (plus those below it)2. Greedily remove the one that most improves the
validation set accuracy Produces smallest version of most accurate
subtreee
33
Rule Post-Pruning
Procedure (C4.5 uses a variant) infer DT as usual (allow overfit) convert tree to rules (one per path) prune each rule independently
remove preconditions if result is more accurate sort rules by estimated accuracy into a
desired sequence to use apply rules in this order in classification
34
Rule post-pruning…
Estimating the accuracy separate validation set training data & pessimistic estimates
data is too favorable for the rules compute accuracy & standard deviation take lower bound from given confidence level
(e.g. 95%) as the measure very close to observed one for large sets not statistically valid but works
35
Why convert to rules?
Distinguishing different contexts in which a node is used separate pruning decision for each path
No difference for root/inner no bookkeeping on how to reorganize tree if
root node is pruned
Improves readability
36
Converting a Tree to Rules
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
R1: If (Outlook=Sunny) (Humidity=High) Then PlayTennis=No R2: If (Outlook=Sunny) (Humidity=Normal) Then PlayTennis=YesR3: If (Outlook=Overcast) Then PlayTennis=Yes R4: If (Outlook=Rain) (Wind=Strong) Then PlayTennis=NoR5: If (Outlook=Rain) (Wind=Weak) Then PlayTennis=Yes
37
Continuous values
Define new discrete-valued attr partition the continuous value into a discrete set of
intervals Ac = true iff A < c How to select best c? (inf. gain)
Example case sort examples by continuous value identify borderlines
Temperature 150C 180C 190C 220C 240C 270C
PlayTennis No No Yes Yes Yes No
38
Continuous values…
Fact value maximizing inf. gain lies on such
boundary
Evaluation compute gain for each boundary
Extensions multiple values LTUs based on many attributes
39
Alternative selection measures
Information gain measure favors attributes with many values separates data into small subsets high gain, poor prediction E.g.: Imagine using Date=xx.yy.zz as attribute
perfectly splits the data into subsets of size 1
Use gain ratio instead of information gain!! penalize gain with split information sensitive to how broadly & uniformly attribute splits
Where Si is the subset for which attribute A has the value vi
Discourages selection of attr with many uniformly distr. Values n values: log n, boolean: 1
Better to apply heuristics to select attributes compute Gain first compute GR only when Gain large enough (above
average)
41
Another alternative
Distance-based measure define a metric between partitions of the data evaluate attributes: distance between created
& perfect partition choose the attribute with closest one
Shown not biased towards attr. with large value sets
Many alternatives like this in the literature.
42
Missing values
Estimate value other examples with known value
Compute Gain(S,A), A(x) unknown assign most common value in S most common with class c(x) assign probability for each value, distribute
fractionals of x down
Similar techniques in classification
43
Attributes with differing costs
Measuring attribute costs something prefer cheap ones if possible use costly ones only if good gain introduce cost term in selection measure no guarantee in finding optimum, but give
bias towards cheapest Replace Gain by : Gain2(S,A)/Cost(A) Example applications
robot & sonar: time required to position medical diagnosis: cost of a laboratory test
44
Summary
Decision-tree induction is a popular approach to Decision-tree induction is a popular approach to classification that enables us to interpret the output classification that enables us to interpret the output hypothesishypothesis
Complete hypothesis space Preference bias - shorter trees preferred.shorter trees preferred. Overfitting is an important issue Overfitting is an important issue
Techniques exist to deal with continuous attributes and Techniques exist to deal with continuous attributes and missing attribute values.missing attribute values.