Decision Tree Learning - WordPress.com...2016/10/03 · Decision Tree Learning Decision Tree Learning is a method for approximating discrete- valued target functions, in which the

Decision Tree Learning

CSE 6003 – Machine Learning and Reasoning

Outline

◘ What is Decision Tree Learning?

◘ What is Decision Tree?

◘ Decision Tree Examples

◘ Decision Trees to Rules

◘ Decision Tree Construction

◘ Decision Tree Algorithms

◘ Decision Tree Overfitting

http://www.newcastle-schools.org.uk/nsn/chemistry/Radioactivity/Contents Page.htm




Paradigms of Machine Learning

Machine

Learning

Neural Network

Genetic Algorithms

Decision Trees

Bayesian Learning

Decision Tree technique is one of the machine learning techniques

Learning Types

Learning

Supervised Learning Unsupervised Learning

Classification

Regression

Clustering

Association AnalysisDecision Tree Learning

Bayesian Learning

Nearest Neighbour

Neural Networks

Support Vector Machines

Sequence Analysis

Summerization

Descriptive Statistics

Outlier Analysis

Scoring

Decision Tree Learning is in the supervised learning type.

Decision Tree Learning

◘ Decision Tree Learning is a method for approximating discrete-

valued target functions, in which the learned function is represented

by a decision tree.

◘ Decision Tree Learning is robust to noisy data and capable of

learning disjunctive expressions.

◘ One of the most widely used method for inductive inference.

Salary < 1 M

Job = teacher

Good

Age < 30

BadBad Good

House Hiring

Decision Tree Representation

◘ Decision Trees classify instances by sorting them down the tree from

the root to some leaf node, which provides the classification of the

instance.

◘ Each node in the tree specifies a test of some attribute of the instance

◘ Each branch descending from that node corresponds to one of the

possible values for this attributes

Decision Trees

◘ Decision Tree is a tree where

– internal nodes are simple decision rules on one or more attributes

– each branch corresponds to an attribute value

– leaf nodes are predicted class labels

◘ Decision trees are used for deciding between several courses of action

age income student credit_rating buys_computer

<=30 high no fair no

<=30 high no excellent no

31…40 high no fair yes

>40 medium no fair yes

>40 low yes fair yes

>40 low yes excellent no

31…40 low yes excellent yes

<=30 medium no fair no

<=30 low yes fair yes

>40 medium yes fair yes

<=30 medium yes excellent yes

31…40 medium no excellent yes

31…40 high yes fair yes

>40 medium no excellent no

age?

student? credit rating?

<=30 >40

no yes yes

yes

31..40

FairExcellentYesNo

Attribute

Value

Classification

Desicion Tree Applications

class1

class1class2

class3class5

class3class1

class4

◘ Has been used for

1. Classification

2. Data Reduction

◘ Initial attribute set: {A1, A2, A3, A4, A5, A6}

◘ Reduced attribute set: {A1, A4, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

Decision Tree Example

◘ A credit card company receives thousands of applications for new cards. Each application contains information about an applicant,

– age

– marital status

– annual salary

– outstanding debts

– credit rating

– etc.

◘ Problem: to decide whether an application should approved, or to classify applications into two categories, approved and not approved.

Decision Tree Example (Cont)

Approved or not


Decision nodes and leaf nodes (classes)


◘ Construct a classification model from the data

◘ Use the model to classify future loan applications into

– Yes (approved) and

– No (not approved)

◘ What is the class for following case/instance?

Use the Decision Tree (Cont)

No

Once the tree is trained, then a new instance is classified by starting at the root and

following the path as dictated by the test results for this instance.


◘ Problem: decide whether to wait for a table at a restaurant

◘ Attributes:

1. Alternate: is there an alternative restaurant nearby?

2. Bar: is there a comfortable bar area to wait in?

3. Fri/Sat: is today Friday or Saturday?

4. Hungry: are we hungry?

5. Patrons: number of people in the restaurant (None, Some, Full)

6. Price: price range ($, $$, $$$)

7. Raining: is it raining outside?

8. Reservation: have we made a reservation?

9. Type: kind of restaurant (French, Italian, Thai, Burger)

10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)

Decision Tree Example (Cont.)

◘ Classification of examples is positive (T) or negative (F)


◘ Here is the “true” tree for deciding whether to wait

Decision Trees to Rules


◘ It is easy to derive a rule set from a decision tree

◘ Write a rule for each path in the decision tree from the root to a leaf.

◘ Can be represented as if-then rules

Example:

IF (Outlook=Sunny) (Humidity=High)

THEN PlayTennis = No


Decision Trees Construction

Decision Tree

◘ Each node tests some attribute of the instance

◘ Instances are represented by attribute-value pairs

◘ High information gain attributes close to the root

◘ Root: best attribute for classification

Which attribute is the best classifier?

answer based on information gain

Entropy

◘ Entropy specifies the minimum number of bits of information

needed to encode the classification of an arbitrary member of S

◘ In general:

◘ Example for two class labels

m

1ii2i plogp)S(Entropy

222121 plogpplogp)S(Entropy

Entropy

Information Gain

◘ Measures the expected reduction in entropy given the value of some

attribute A

Values(A): Set of all possible values for attribute A

Si: Subset of S for which attribute A has value v

)S(Entropy|S|

|S|)S(Entropy)A,S(Gain i

i

Ai


Which attribute first?



940,0)14/5(log)14/5()14/9(log)14/9()( 22 SEntropi

)S(Entropy|S|

|S|)S(Entropy

|S|

|S|)S(Entropy)Wind,S(Gain Strong

Strong

WeakWeak

048,0

0,1*14

6811,0*

14

8940,0

)S(Entropy|S|

|S|)S(Entropy

|S|

|S|)S(Entropy)Huminity,S(Gain Normal

NormalHigh

High

151,0

0,1*14

7985,0*

14

7940,0

Gain(S, Outlook) = 0,246

Gain(S, Temperature) = 0,029

Gain(S, Huminity) = 0,151

Gain(S, Wind) = 0,048


Decision Tree Construction

◘ Which attribute is next?

Outlook

SunnyOvercast Rain

? Yes?

019,0970,0918,0)5/3(0,1)5/2(970,0)Wind,S(Gain Sunny

970,00,0)5/2(0,0)5/3(970,0)Huminity,S(Gain Sunny

570,00)5/1(1)5/2(0)5/2(970,0)eTemperatur,S(Gain Sunny


[D3,D7,D12,D13]

[D9,D11] [D4,D5,D10][D1,D2, D8] [D6,D14]

Another Example

At the weekend:

- go shopping,

- watch a movie,

- play tennis or

- just stay in.

What you do depends on three things:

- the weather (windy, rainy or sunny);

- how much money you have (rich or poor)

- whether your parents are visiting.

Another Example (Cont.)

height hair eyes class

short blond blue +

tall blond brown -

tall red blue +

short dark blue -

tall dark blue -

tall blond blue +

tall dark brown -

short blond brown -

I(3+, 5-) = -3/8log23/8 – 5/8log25/8 = 0.954434003

Height: short (1+, 2-) tall(2+, 3-)

Gain(height) = 0.954434003 - 3/8*I(1+,2-) - 5/8*I(2+,3-) =

= 0.954434003 – 3/8(-1/3log21/3 - 2/3log22/3) – 5/8(-2/5log22/5 - 3/5log23/5) = 0.003228944

Hair: blond(2+, 2-) red(1+, 0-) dark(0+, 3-)

Gain(hair) = 0.954434003 – 4/8(-2/4log22/4 – 2/4log22/4) – 1/8(-1/1log21/1-0) –

-3/8(0-3/3log23/3) = 0.954434003 – 0.5 = 0.454434003

Eyes: blue(3+, 2-) brown(0+, 3-)

Gain(eyes) = 0.954434003 – 5/8(-3/5log23/5 – 2/5log22/5) -5/8(=

= 0.954434003 - 0.606844122 = 0.347589881

“Hair” is the best attribute.

Another Example

34

height hair eyes class

short blond blue +

tall blond brown -

tall red blue +

short dark blue -

tall dark blue -

tall blond blue +

tall dark brown -

short blond brown - hair

dark red blond

short, dark, blue: -tall, dark, blue: -tall, bark, brown: -

tall, red, blue: + short, blond, blue: +tall, blond, brown: -tall, blond, blue: +short, blond, brown: -

Another Example (Cont.)

Decision Trees Algorithms

Decision Tree Algorithms

◘ ID3

– Quinlan (1981)

– Tries to reduce expected number of comparison

◘ C 4.5

– Quinlan (1993)

– It is an extension of ID3

– Just starting to be used in data mining applications

– Also used for rule induction

◘ CART

– Breiman, Friedman, Olshen, and Stone (1984)

– Classification and Regression Trees

◘ CHAID

– Kass (1980)

– Oldest decision tree algorithm

– Well established in database marketing industry

◘ QUEST

– Loh and Shih (1997)

Frequency Usage

Complexity of Tree Induction

◘ Assume

– m attributes

– n training instances

– tree depth O (log n)

◘ Building a tree O (m n log n)

◘ Total cost: O (m n log n)

Decision Tree Adv. DisAdv.

Positives (+)

+ Reasonable training time

+ Fast application

+ Easy to interpret

+ Rule extraction from trees

(can be re-represented as if-then-else

rules)

+ Easy to implement

+ Can handle large number of features

+ Does not require any prior knowledge

of data distribution

Negatives (-)

- Cannot handle complicated

relationship between features

- Problems with lots of missing data

- Output attribute must be categorical

- Limited to one output attribute

- Difficulties involving in design an

optimal decision tree

- Overlap especially when the number of

classes is large

Decision Tree Learning - WordPress.com...2016/10/03 · Decision Tree Learning Decision Tree Learning is a method for approximating discrete- valued target functions, in which the

Documents