Decision Trees
2
Outline What is a decision tree ? How to construct a decision tree ?
What are the major steps in decision tree induction ?
How to select the attribute to split the node ?
What are the other issues ?
3
Classification by Decision Tree Induction Decision tree
A flow-chart-like tree structure Internal node denotes a test on an
attribute Branch represents an outcome of test Leaf nodes represent class labels or class
distribution Age?
Student? Credit?
fairexcellent
>40
31…40
30
NO YES
no yes
NO YES
YES
4
Training Dataset
Age Income
Student
Credit Buys_computer
P1 <=30
high no fair no
P2 <=30
high no excellent
no
P3 31…40
high no fair yes
P4 >40 medium
no fair yes
P5 >40 low yes fair yes
P6 >40 low yes excellent
no
P7 31…40
low yes excellent
yes
P8 <=30
medium
no fair no
P9 <=30
low yes fair yes
P10 >40 medium
yes fair yes
P11 <=30
medium
yes excellent
yes
P12 31…40
medium
no excellent
yes
P13 31…40
high yes fair yes
P14 >40 medium
no excellent
no
5
Output: A Decision Tree for “buy_computer”
Age?
Student? Credit?
fairexcellent
>4031…40<=30
NO YES
no yes
NO YES
YES
6
Outline What is a decision tree ? How to construct a decision tree ?
What are the major steps in decision tree induction ?
How to select the attribute to split the node ?
What are the other issues ?
7
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-
valued, they are discretized in advance) Tree is constructed in a top-down recursive
divide-and-conquer manner At start, all training examples are at the
root Test attributes are selected on basis of a
heuristic or statistical measure (e.g., information gain)
Examples are partitioned recursively based on selected attributes
8
Training Dataset
Age Income
Student
Credit Buys_computer
P1 <=30
high no fair no
P2 <=30
high no excellent
no
P3 31…40
high no fair yes
P4 >40 medium
no fair yes
P5 >40 low yes fair yes
P6 >40 low yes excellent
no
P7 31…40
low yes excellent
yes
P8 <=30
medium
no fair no
P9 <=30
low yes fair yes
P10 >40 medium
yes fair yes
P11 <=30
medium
yes excellent
yes
P12 31…40
medium
no excellent
yes
P13 31…40
high yes fair yes
P14 >40 medium
no excellent
no
9
Construction of A Decision Tree for “buy_computer”
? [P1,…P14]
Yes: 9, No:5
10
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-
valued, they are discretized in advance) Tree is constructed in a top-down recursive
divide-and-conquer manner At start, all training examples are at the
root Test attributes are selected on basis of a
heuristic or statistical measure (e.g., information gain)
Examples are partitioned recursively based on selected attributes
11
Construction of A Decision Tree for “buy_computer”
Age?
>4031…40<=30
[P1,…P14]Yes: 9,
No:5
12
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-
valued, they are discretized in advance) Tree is constructed in a top-down recursive
divide-and-conquer manner At start, all training examples are at the
root Test attributes are selected on basis of a
heuristic or statistical measure (e.g., information gain)
Examples are partitioned recursively based on selected attributes
13
Training Dataset
Age Income
Student
Credit Buys_computer
P1 <=30
high no fair no
P2 <=30
high no excellent
no
P3 31…40
high no fair yes
P4 >40 medium
no fair yes
P5 >40 low yes fair yes
P6 >40 low yes excellent
no
P7 31…40
low yes excellent
yes
P8 <=30
medium
no fair no
P9 <=30
low yes fair yes
P10 >40 medium
yes fair yes
P11 <=30
medium
yes excellent
yes
P12 31…40
medium
no excellent
yes
P13 31…40
high yes fair yes
P14 >40 medium
no excellent
no
14
Construction of A Decision Tree for “buy_computer”
Age?
>4031…40<=30
[P1,…P14]Yes: 9,
No:5
[P1,P2,P8,P9,P11]
Yes: 2, No:3
[P3,P7,P12,P13]Yes: 4, No:0
[P4,P5,P6,P10,P14]
Yes: 3, No:2YES ? ?
15
Construction of A Decision Tree for “buy_computer”
Age?
>4030…40<=30
[P1,…P14]Yes: 9,
No:5
[P1,P2,P8,P9,P11]
Yes: 2, No:3
[P3,P7,P12,P13]Yes: 4, No:0
[P4,P5,P6,P10,P14]
Yes: 3, No:2Student?
no yes
YES ?
[P1,P2,P8]
Yes: 0, No:3
[P9,P11]Yes: 2,
No:0NO YES
16
Construction of A Decision Tree for “buy_computer”
Age?
>4030…40<=30
[P1,…P14]Yes: 9,
No:5
[P1,P2,P8,P9,P11]
Yes: 2, No:3
[P3,P7,P12,P13]Yes: 4, No:0
[P4,P5,P6,P10,P14]
Yes: 3, No:2Student?
no yes
YES
[P1,P2,P8]
Yes: 0, No:3
[P9,P11]Yes: 2,
No:0
Credit?
fairexcellent
NO YES NO YES
[P6,P14]Yes: 0,
No:2
[P4,P5,P10]
Yes: 3, No:0
17
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-
valued, they are discretized in advance) Tree is constructed in a top-down recursive
divide-and-conquer manner At start, all training examples are at the
root Test attributes are selected on basis of a
heuristic or statistical measure (e.g., information gain)
Examples are partitioned recursively based on selected attributes
18
Outline What is a decision tree ? How to construct a decision tree ?
What are the major steps in decision tree induction ?
How to select the attribute to split the node ?
What are the other issues ?
19
Which Attribute is the Best?
The attribute most useful for classifying examples
Information gain An information-theoretic approach Measure how well an attribute separates
the training examples Use the attribute with the highest
information gain to split Minimize the expected number of tests
needed to classify a new tuple
How useful?
How well separated?
How pure splitting result?
Information gain
20
Choosing an attribute Idea: a good attribute splits the examples into subsets
that are (ideally) "all positive" or "all negative"
Patrons? is a better choice
21
Information theory
If there are n equally probable possible messages, then the probability p of each is 1/n
Information conveyed by a message is -log(p) = log(n)
E.g., if there are 16 messages, then log(16) = 4 and we need 4 bits to identify/send each message
In general, if we are given a probability distribution
P = (p1, p2, .., pn) Then the information conveyed by the distribution (aka entropy of P) is:
I(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn))
22
Information theory II Information conveyed by distribution (a.k.a.
entropy of P):
I(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn))
Examples: If P is (0.5, 0.5) then I(P) is 1 If P is (0.67, 0.33) then I(P) is 0.92 If P is (1, 0) then I(P) is 0
The more uniform the probability distribution, the greater its information: More information is conveyed by a message telling you which event actually occurred
Entropy is the average number of bits/message needed to represent a stream of messages
23
Information for classification
If a set S of records is partitioned into disjoint exhaustive classes (C1,C2,..,Ck) on the basis of the value of the class attribute, then the information needed to identify the class of an element of S is Info(S) = I(P)
where P is the probability distribution of partition (C1,C2,..,Ck):
P = (|C1|/|S|, |C2|/|S|, ..., |Ck|/|S|)
C1
C2
C3
C1
C2C3
High informationLow information
24
Information, Entropy, and Information Gain
S contains si tuples of class Ci for i = {1, ..., m} Information measures “the amount of info” required to classify any arbitrary tuple
where is the probability that an arbitrary tuple belongs to Ci
Example: S contains 100 tuples, 25 belong to class C1 and 75 belong to class C2
i
m
iim21 pp,...,s,ss 2
1
log)I(
811.0100
75log
100
75
100
25log
100
25)75,25I(
22
|| S
sp ii
25
Information, Entropy, and Information Gain
Information reflects the “purity” of the data set
Low information value indicates high purity
High information value indicates high diversity
Example: S contains 100 tuples 0 belongs to class C1 and 100 belong to class
C2
50 belong to class C1 and 50 belong to class C2
0100
100log
100
100
100
0log
100
0)100,0I(
22
1100
50log
100
50
100
50log
100
50)50,50I(
22
26
Information for classification II If we partition S w.r.t attribute X into sets
{T1,T2, ..,Tn} then the information needed to identify the class of an element of S becomes the weighted average of the information needed to identify the class of an element of Ti, i.e. the weighted average of Info(Ti):
Info(X,T) = |Ti|/|S| * Info(Ti)
C1
C2
C3C1
C2
C3
High information Low information
27
Information gain Consider the quantity Gain(X,S) defined as Gain(X,S) = Info(S) - Info(X,S) This represents the difference between
information needed to identify an element of S and information needed to identify an element of S after the
value of attribute X has been obtained
That is, this is the gain in information due to attribute X We can use this to rank attributes and to build
decision trees where at each node is located the attribute with greatest gain among the attributes not yet considered in the path from the root
The intent of this ordering is: To create small decision trees so that records can be
identified after only a few questions To match a hoped-for minimality of the process
represented by the records being considered (Occam’s Razor)
28
Information, Entropy, and Information Gain
S contains si tuples of class Ci for i = {1, ..., m}
Attribute A has values {a1,a2,...,av} Let sij be the number of tuples which
belong to class Ci, and have a value of aj in attribute A
Entropy of attribute A is
Information gained by branching on attribute A
),...,(||
...E(A) 1
1
1
mjj
mjj
ssIS
ssv
j
)E(),...,,I()Gain( 21 AsssA m
29
Information, Entropy, and Information Gain
Let Tj be the set of tuples having value aj in attribute A s1j+…,+smj = |Tj| I(s1j,…,smj) = I(Tj)
Entropy of attribute A is ),...,(
||
...E(A) 1
1
1
mjj
mjj
ssIS
ssv
j
Proportion of |Tj| over |S|
Information of Tj
30
Information, Entropy, and Information Gain
A=a2
A=a3
40 60
10
10
10 20
20 30
I(10,10)=1
I(10,20)=0.918
I(20,30)=0.971
971.0100
3020918.0
100
20101
100
1010)(
AE
961.0
01.0)()60,40()( AEIAGain
I(40,60)=0.971
S contains 100 tuples, 40 belong to class C1 (red) and 60 belong to class C2 (blue)
30 tuples
50 tuples
A=a120 tuples
31
Computing information gain
French
Italian
Thai
Burger
Empty Some Full
Y
Y
Y
Y
Y
YN
N
N
N
N
N
•I(S) = - (.5 log .5 + .5 log .5) = .5 + .5 = 1
•I (Pat, S) = 1/6 (0) + 1/3 (0) + 1/2 (- (2/3 log 2/3 +
1/3 log 1/3)) = 1/2 (2/3*.6 + 1/3*1.6) = .47
•I (Type, S) = 1/6 (1) + 1/6 (1) + 1/3 (1) + 1/3 (1) = 1
Gain (Pat, S) = 1 - .47 = .53Gain (Type, T) = 1 – 1 = 0
32
Regarding the Definition of Entropy…
On Text book Page 134 (Equ. 3.6)
On Text book Page 287 (Equ. 7.2)
m
iii ppSEntropy
12log)(
),...,(||
...(A)E 1
1
1
mjj
mjj
ssIS
ssntropy
v
j
PolymophismWhen entropy is defined on tuples, use Equ. 3.6When entropy is defined on attribute, use Equ. 7.2
33
How well does it work?
Many case studies have shown that decision trees are at least as accurate as human experts. A study for diagnosing breast cancer had humans correctly classifying the examples 65% of the time; the decision tree classified 72% correct
British Petroleum designed a decision tree for gas-oil separation for offshore oil platforms that replaced an earlier rule-based expert system
Cessna designed an airplane flight controller using 90,000 examples and 20 attributes per example
34
Outline What is a decision tree ? How to construct a decision tree ?
What are the major steps in decision tree induction ?
How to select the attribute to split the node ?
What are the other issues ?
35
Extracting Classification Rules from Trees Represent knowledge in the form of IF-
THEN rules One rule is created for each path from
root to a leaf Each attribute-value pair along a path
forms a conjunction Leaf node holds class prediction Rules are easier for humans to
understand
36
Examples of Classification Rules
Age?
Student? Credit?
fairexcellent
>40
31…40
30
NO YES
no yes
NO YES
YES
Classification rules:1. IF age = “<=30” AND student = “no” THEN buys_computer = “no”2. IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”3. IF age = “31…40” THEN buys_computer = “yes”4. IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”5. IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
37
Avoid Over-fitting in Classification Generated tree may over-fit training data
Too many branches, some may reflect anomalies due to noise or outliers
Result is in poor accuracy for unseen samples
Two approaches to avoiding over-fitting Pre-pruning: Halt tree construction early—
do not split a node if this would result in goodness measure falling below a threshold
Difficult to choose an appropriate threshold Post-pruning: Remove branches from a
“fully grown” tree—get a sequence of progressively pruned trees
Use a set of data different from training data to decide which is “best pruned tree”
38
Enhancements to basic decision tree induction
Dynamic discretization for continuous-valued attributes
Dynamically define new discrete-valued attributes that partition continuous attribute value into a discrete set of intervals
Handle missing attribute values Assign most common value of attribute Assign probability to each of possible values
Attribute construction Create new attributes based on existing ones
that are sparsely represented Reduce fragmentation (no. of samples at branch
becomes too small to be statistically significant), repetition (attribute is repeatedly tested along a branch), and replication (duplicate subtrees)
39
Classification in Large Databases Classification—a classical problem extensively
studied by statisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed
Why decision tree induction in data mining? relatively faster learning speed (than other
classification methods) convertible to simple and easy to understand
classification rules can use SQL queries for accessing databases comparable classification accuracy with other
methods
40
Scalable Decision Tree Induction Methods SLIQ (EDBT’96 — Mehta et al.)
Build an index for each attribute and only class list and the current attribute list reside in memory
SPRINT (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning:
stop growing the tree earlier RainForest (VLDB’98 — Gehrke,
Ramakrishnan & Ganti) separates the scalability aspects from the
criteria that determine the quality of the tree builds an AVC-list (attribute, value, class
label)
41
Summary What is a decision tree ?
A flow-chart-like tree: internal nodes, branches, and leaf nodes
How to construct a decision tree ? What are the major steps in decision tree
induction ? Test attribute selection Sample partition
How to select the attribute to split the node ? Select the attribute with the highest
information gain Calculate the information of the node Calculate the entropy of the attribute Calculate the difference between the
information and entropy What are the other issues ?