COMP3740 CR32: Knowledge Management and Adaptive Systems Supervised ML to learn Classifiers: Decision Trees and Classification Rules Eric Atwell, School.

Post on 28-Mar-2015

225 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

COMP3740 CR32:Knowledge Management

and Adaptive Systems

Supervised ML to learn Classifiers: Decision Trees and Classification Rules

Eric Atwell, School of Computing, University of Leeds

(including re-use of teaching resources from other sources, esp. Knowledge Management by Stuart Roberts,

School of Computing, University of Leeds)

Reminder:Objectives of data mining

• Data mining aims to find useful patterns in data.

• For this we need:– Data mining techniques, algorithms, tools, eg WEKA– A methodological framework to guide us, in collecting

data and applying the best algorithms, CRISP-DM

• TODAY’S objective: learn how to learn classifiers

• Decision Trees and Classification Rules

• Supervised Machine Learning: training set has the “answer” (class) for each example (instance)

Reminder:Concepts that can be “learnt”

The types of concepts we try to ‘learn’ include:• Clusters or ‘Natural’ partitions;

– Eg we might cluster customers according to their shopping habits.

• Rules for classifying examples into pre-defined classes.– Eg “Mature students studying information systems with high grade

for General Studies A level are likely to get a 1st class degree”

• General Associations – Eg “People who buy nappies are in general likely also to buy beer”

• Numerical prediction– Eg Salary = a*A-level + b*Age + c*Gender + d*Prog + e*Degree

(but are Gender, Programme really numbers???)

Output: decision treeOutlook

Humidity

sunny

high

Play = ‘no’

normal

Play = ‘yes’

Windy

rainy

true

Play = ‘no’

false

Play = ‘yes’

Decision Tree Analysis

• Example instance setShares files Uses

scanner Infected before Risk

Yes Yes No High Yes No No High No No Yes Medium Yes Yes Yes Low Yes Yes No High No Yes No Low Yes No Yes High

Can we predict, from the first 3 columns, the risk of getting a virus?

For convenience later:F = ‘shares Files’, S = ‘uses Scanner’, I = ‘Infected before’

Decision tree building method• Forms a decision tree

– tries for a small tree covering all or most of the training set– internal nodes represent a test on an attribute value– branches represent outcome of the test

• Decides which attribute to test at each node– this is based on a measure of ‘entropy’

• Must avoid ‘over-fitting’– if the tree is complex enough it might describe the training

set exactly, but be no good for prediction

• May leave some ‘exceptions’

Building a decision tree (DT)The algorithm is recursive, at any step:

T = set of (remaining) training instances,

{C1, …, Ck} = set of classes

• If all instances in T belong to a single class Ci, then DT is a leaf node identifying class Ci. (done!)

…continued

Building a decision tree (DT)…continued • If T contains instances belonging to mixed

classes, then choose a test based on a single attribute that will partition T into subsets {T1, …, Tn} according to n outcomes of the test.The DT for T comprises a root node identifying the test and one branch for each outcome of the test.

• The branches are formed by applying the rules above recursively on each of the subsets {T1, …, Tn} .

F S I Risk Yes Yes No High Yes No No High No No Yes Medium Yes Yes Yes Low Yes Yes No High No Yes No Low Yes No Yes High

T = Classes = {High, Medium, Low}

Choose a test based on F, number of outcomes, n = 2 (Yes or No)

Fyes no

F S I Risk Yes Yes No High Yes No No High Yes Yes Yes Low Yes Yes No High Yes No Yes High

T1 = F S I Risk No No Yes Medium No Yes No Low

T2 =

Tree Building example

Tree Building example

T1 = Classes = {High, Medium, Low}

Choose a test based on I, number of outcomes, n = 2 (Yes or No)

F S I Risk Yes Yes No High Yes No No High Yes Yes Yes Low Yes Yes No High Yes No Yes High

T3 = F S I Risk Yes Yes Yes Low Yes No Yes High

T4 = F S I Risk Yes Yes No High Yes No No High Yes Yes No High

Iyes no

Fyes no

Tree Building example

T1 = Classes = {High, Medium, Low}

Choose a test based on I, number of outcomes, n = 2 (Yes or No)

F S I Risk Yes Yes No High Yes No No High Yes Yes Yes Low Yes Yes No High Yes No Yes High

T3 = F S I Risk Yes Yes Yes Low Yes No Yes High

Iyes no

Fyes no

Risk = ‘High’

Tree Building example

Classes = {High, Medium, Low}

Choose a test based on S, number of outcomes, n = 2 (Yes or No)

Iyes no

T3 = F S I Risk Yes Yes Yes Low Yes No Yes High

Fyes no

Risk = ‘High’Syes no

F S I Risk Yes No Yes High

F S I Risk Yes Yes Yes Low

Tree Building example

Classes = {High, Medium, Low}

Choose a test based on S, number of outcomes, n = 2 (Yes or No)

Iyes no

T3 = F S I Risk Yes Yes Yes Low Yes No Yes High

Fyes no

Risk = ‘High’Syes no

Risk = ‘Low’ Risk = ‘High’

Tree Building example

Classes = {High, Medium, Low}

Choose a test based on S, number of outcomes, n = 2 (Yes or No)

Iyes no

T2 =

Fyes no

Syes no

Risk = ‘Low’

F S I Risk No No Yes Medium No Yes No Low

Risk = ‘High’

Risk = ‘High’

Syes no

F S I Risk No Yes No Low

F S I Risk No No Yes Medium

Tree Building example

Classes = {High, Medium, Low}

Choose a test based on S, number of outcomes, n = 2 (Yes or No)

Iyes no

T2 =

Fyes no

Syes no

Risk = ‘Low’

F S I Risk No No Yes Medium No Yes No Low

Risk = ‘High’

Risk = ‘High’

Syes no

Risk = ‘Low’

Risk = ‘Medium’

Example Decision TreeShares files?

no yes

Uses scanner?

no yes

Infected before?

yes no

Uses scanner?

no Yes

medium low

lowhigh

high

Which attribute to test?• The ROOT could be S or I instead of F – leading to a

different Decision Tree

• Best DT is the “smallest”, most concise model

• The search space in general is too large to find the smallest tree by exhaustive searching (try them all).

• Instead we look for the attribute which splits the training set into the most homogeneous sets.

• The measure used for ‘homogeneity’ is based on entropy.

F S I High Risk?

Yes Yes No Yes Yes No No Yes No No Yes No Yes Yes Yes No Yes Yes No Yes No Yes No No Yes No Yes Yes

T = Classes = {Yes, No}

Choose a test based on F, number of outcomes, n = 2 (Yes or No)Fyes no

Tree Building example (modified)

F S I High Risk?

Yes Yes No Yes Yes No No Yes Yes Yes Yes No Yes Yes No Yes Yes No Yes Yes

F S I High Risk?

No No Yes No No Yes No No

F S I High Risk?

Yes Yes No Yes Yes No No Yes No No Yes No Yes Yes Yes No Yes Yes No Yes No Yes No No Yes No Yes Yes

T = Classes = {Yes, No}

Choose a test based on F, number of outcomes, n = 2 (Yes or No)Fyes no

Tree Building example (modified)

High Risk = ‘yes’

5, 1

High Risk = ‘no’

2, 0

F S I High Risk?

Yes Yes No Yes Yes No No Yes No No Yes No Yes Yes Yes No Yes Yes No Yes No Yes No No Yes No Yes Yes

T = Classes = {Yes, No}

Choose a test based on S, number of outcomes, n = 2 (Yes or No)Syes no

Tree Building example (modified)

F S I High Risk?

Yes Yes No Yes Yes Yes Yes No Yes Yes No Yes No Yes No No

F S I High Risk?

Yes No No Yes No No Yes No Yes No Yes Yes

F S I High Risk?

Yes Yes No Yes Yes No No Yes No No Yes No Yes Yes Yes No Yes Yes No Yes No Yes No No Yes No Yes Yes

T = Classes = {Yes, No}

Choose a test based on S, number of outcomes, n = 2 (Yes or No)Syes no

Tree Building example (modified)

High Risk = ‘no’

4,2

High Risk = ‘yes’

3,1

F S I High Risk?

Yes Yes No Yes Yes No No Yes No No Yes No Yes Yes Yes No Yes Yes No Yes No Yes No No Yes No Yes Yes

T = Classes = {Yes, No}

Choose a test based on I, number of outcomes, n = 2 (Yes or No)Iyes no

Tree Building example (modified)

F S I High Risk?

No No Yes No Yes Yes Yes No Yes No Yes Yes

F S I High Risk?

Yes Yes No Yes Yes No No Yes Yes Yes No Yes No Yes No No

F S I High Risk?

Yes Yes No Yes Yes No No Yes No No Yes No Yes Yes Yes No Yes Yes No Yes No Yes No No Yes No Yes Yes

T = Classes = {Yes, No}

Choose a test based on I, number of outcomes, n = 2 (Yes or No)Iyes no

Tree Building example (modified)

High Risk = ‘no’

3, 1

High Risk = ‘yes’

4,1

Decision tree building algorithm• For each decision point,

– If all remaining examples are all +ve or all -ve, stop.– Else if there are some +ve and some -ve examples left and

some attributes left, pick the remaining attribute with largest information gain

– Else if there are no examples left, no such example has been observed; return default

– Else if there are no attributes left, examples with the same description have different classifications: noise or insufficient attributes or nondeterministic domain

Evaluation of decision trees• At the leaf nodes two numbers are given:

– N: the coverage for that node: how many instances– E: the error rate: how many wrongly classified instances

• The whole tree can be evaluated in terms of its size (number of nodes) and overall error-rate expressed in terms of the number and percentage of cases wrongly classified.

• We seek small trees that have low error rates.

Evaluation of decision trees• The error rate for the whole tree can also be

displayed in terms of a confusion matrix:(A) (B) (C) Classified as

35 2 1 Class (A) = high

4 41 5 Class (B) = medium

2 5 68 Class (C) = low

Evaluation of decision trees• The error rates mentioned on previous slides are

normally computed usinga. The training set of instances.

b. A test set of instances – some different examples!

• If the decision tree algorithm has ‘over-fitted’ the data, then the error rate based on the training set will be far less than that based on the test set.

Evaluation of decision trees• 10-fold cross-validation can be used when the training set is

limited in size:– Divide the test set randomly into 10 subsets.

– Build a tree from 9 of the subsets and test using the 10th.

– Repeat the experiment 9 more times, using a different test set each time.

– Overall error rate is average of 10 experiments

• 10-fold cross-validation will lead to up to 10 different decision trees being built. The method for selecting or constructing the best tree is not clear.

From decision trees to rules• Decision trees may not be easy to interpret:

– tests associated with lower nodes have to be read in the context of tests further up the tree

– ‘sub-concepts’ may sometimes be split up and distributed to different parts of the tree (see next slide)

– Computer Scientists may prefer “if … then …” rules!

DT for “F = G = 1 or J = K = 1”

F= 0;J = 0; noJ = 1;

K = 0; noK = 1; yes

F = 1;G = 1; yesG = 0;

J = 0; noJ = 1;

K = 0; noK = 1; yes

J=K=1 is split across two subtrees.

F

J

K

G

J

K

0

0

0

0

0

0

1

1

1

1

1

1yes

yes

yes

no

no no

no

Converting DT to rules• Step 1: Every path from root to leaf represents a

rule:F= 0;

J = 0; noJ = 1;

K = 0; noK = 1; yes

F = 1;G = 1; yesG = 0;

J = 0; noJ = 1;

K = 0; noK = 1; yes

If F = 0 and J = 0 then class no;

If F = 0 and J = 1 and K = 0 then class no

If F = 0 and J = 1 and K = 1 then class yes

….

If F = 1 and G = 0 and J = 1 and K = 1then class yes

Generalising rulesIf F = 0 and J = 1 and K = 1 then class yes

If F = 1 and G = 0 and J = 1 and K = 1then class yes

If G = 1 then class yes

If J =1 and K = 1 then class yes

Tidying up rule sets• Generalisation leads to 2 problems:• Rules no longer mutually exclusive

– Order rules and use the first matching rule used as the operative rule.

– Ordering is based on how many false positive errors the rule makes

• Rule set no longer exhaustive– Choose a default value for the class when no rule applies– Default class is that which contains the most training

cases not covered by any rule.

Decision Tree - Revision

Decision tree builder algorithm discovers rules for classifying instances.

At each step, it needs to decide which attribute to test at that point in the tree; a measure of ‘information gain’ can be used.

The output is a decision tree based on the ‘training’ instances, evaluated with separate “test” instances.

Leaf nodes which have a small coverage may be pruned if the error rate is small for the pruned tree.

Pruning example (from W & F)

Health plan contribution

4 bad2 good

1 bad1 good

4 bad2 good

none half full

We replace the subtree with: Bad

14, 5Number of instances

number of errors

Decision trees v classification rules• Decision trees can be used for prediction or

interpretation.– Prediction: compare an unclassified instance against the

tree and predict what class it is in (with error estimate)– Interpretation: examine tree and try to understand why

instances end up in the class they are in.

• Rule sets are often better for interpretation.– ‘Small’, accurate rules can be examined, even if overall

accuracy of the rule set is poor.

Self Check• You should be able to:

– Describe how the decision-trees are built from a set of instances.

– Build a decision tree based on a given attribute– Explain what the ‘training’ and ‘test’ sets are for.– Explain what “Supervised” means, and why classification

is an example of supervised ML

top related