1 1 Slide Classification. 2 2 Supervised vs. Unsupervised Learning n Supervised learning (classification) Supervision: The training data (observations,

1 1 Slide

Slide

Classification

2 2 Slide

Slide

Supervised vs. Unsupervised Learning

Supervised learning (classification)

• Supervision: The training data (observations,

measurements, etc.) are accompanied by

labels indicating the class of the observations

• New data is classified based on the training set

Unsupervised learning (clustering)

• The class labels of training data is unknown

• Given a set of measurements, observations,

etc. with the aim of establishing the existence

of classes or clusters in the data

3 3 Slide

Slide

Classification • predicts categorical class labels (discrete or

nominal)• classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Numeric Prediction • models continuous-valued functions, i.e.,

predicts unknown or missing values Typical applications

• Credit/loan approval:• Medical diagnosis: if a tumor is cancerous or

benign• Fraud detection: if a transaction is fraudulent• Web page categorization: which category it is

Prediction Problems: Classification vs. Numeric Prediction

4 4 Slide

Slide

Classification—A Two-Step Process

Model construction: describing a set of predetermined classes• Each tuple/sample is assumed to belong to a predefined

class, as determined by the class label attribute• The set of tuples used for model construction is training

set• The model is represented as classification rules, decision

trees, or mathematical formula Model usage: for classifying future or unknown objects

• Estimate accuracy of the model• The known label of test sample is compared with the

classified result from the model• Accuracy rate is the percentage of test set samples

that are correctly classified by the model• Test set is independent of training set (otherwise

overfitting) • If the accuracy is acceptable, use the model to classify

new data Note: If the test set is used to select models, it is called

validation (test) set

5 5 Slide

Slide5

Process (1): Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6

THEN tenured = ‘yes’

Classifier(Model)

6 6 Slide

Slide6

Process (2): Using the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

7 7 Slide

Slide7

Decision Tree Induction: An Example

age?

overcast

student? credit rating?

<=30 >40

no yes yes

yes

31..40

no

fairexcellentyesno

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

Training data set: Buys_computer The data set follows an example of

Quinlan’s ID3 Resulting tree:

8 8 Slide

Slide8

Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)• Tree is constructed in a top-down recursive divide-

and-conquer manner• At start, all the training examples are at the root• Attributes are categorical (if continuous-valued,

they are discretized in advance)• Examples are partitioned recursively based on

selected attributes• Test attributes are selected on the basis of a

heuristic or statistical measure (e.g., information gain)

9 9 Slide

Slide

Classification ExampleMBA Application

10 10 Slide

Slide

Plan

Discussion on Classification using Data Mining Tool (It’s suggested to work with your final project teammates.)

11 11 Slide

Slide

Decisions on Admissions

Download data mining tool—Weka.zip and an admission dataset ABC.csv from Course Page.

Work as a team to make decisions on admissions to MBA program at ABC University. Remember to save your file.• Assuming that each record (row) represents

a student, use the provided information (GPA, GMAT, Years of work experience) to make admission decisions.• If you decide to admit the student, fill in

Yes in Admission column. Otherwise, fill in No to reject the application.

12 12 Slide

Slide

Decisions on Admissions (cont.)

Before using Weka, write down your criteria judging the applications.

13 13 Slide

Slide

Decision Tree

A machine learning (data mining) technique. Making decisions. Generating interpretable rules. Able to re-use for future predictions. Automatic and faster response (decision).

14 14 Slide

Slide

Using Weka

Extract Weka.zip to desktop Run the “RunWeka3-4-6.bat” from Weka folder

on your desktop Open the ABC.csv with decisions from Weka

• Set filetype to CSV

15 15 Slide

Slide

Using Weka (cont.)

Click Classify tab Choose J48 Classifier below trees Set the Test options to Use training set Enable Output predictions in More options Click Start to run

16 16 Slide

Slide

Using Weka (cont.)

Read the classification output report from Classifier output

17 17 Slide

Slide

Using Weka (cont.)

Use the decision tree generated from the tool for your write-up (from J48 pruned tree)

18 18 Slide

Slide

Using Weka (cont.)

The generated tree to my example is• GMAT <= 390: No (9.0)• GMAT > 390• | Exp <= 2• | | GMAT <= 450: No (2.0)• | | GMAT > 450: Yes (6.0/1.0)• | Exp > 2: Yes (13.0)

YesNo

YesNo

Rejected

Admitted

YesNo

Admitted

Rejected

GMAT>390

Exp > 2

GMAT>450

19 19 Slide

Slide

Using Weka (cont.)

Performance of Decision Tree

Incorrect prediction

20 20 Slide

Slide

Using Weka (cont.)

Accuracy rate for decision tree prediction =(the number of correct predictions) / (total observations)= 29 / 30= 96.6667%

21 21 Slide

Slide

Using Weka (cont.)

Share your criteria (results) with other teams.

22 22 Slide

Slide

Simple algorithms often work very well!

There are many kinds of simple structure, eg:–

––––

One attribute does all the work

Attributes contribute equally and independently

A decision tree that tests a few attributes

Calculate distance from training instancesResult depends on a linear combination of attributes

Success of method depends on the domain– Data mining is an experimental science

Simplicity first!

23 23 Slide

Slide

OneR: One attribute does all the work

Learn a 1‐level “decision tree”– i.e., rules that all test one particular attribute

Basic version–

––

One branch for each value

Each branch assigns most frequent classError rate: proportionmajority class of their

Choose attribute with

of instances that don’t belong to thecorresponding branch

smallest error rate–

Simplicity first!

24 24 Slide

Slide

For each attribute,For each value of the attribute, make a rule as follows:

count how often each class appears Find the most frequent classMake the rule assign that class to this attribute-value

Calculate the error rate of this attribute’s rules

Choose the attribute with the smallest error rate

Simplicity first!

25 25 Slide

Slide

* indicates a tie

Outlook Temp Humidity Wind PlaySunny Hot High False

No

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild High False Yes

Rainy Cool Normal False Yes

Rainy Cool Normal True No

Overcast Cool Normal True Yes

Sunny Mild High False No

Sunny Cool Normal False Yes

Rainy Mild Normal False Yes

Sunny Mild Normal True Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes

Rainy Mild High True No

Attribute Rules Errors Total error

sOutlook Sunny No 2/5 4/14

Overcast Yes 0/4

Rainy Yes 2/5

Temp Hot No* 2/4 5/14

Mild Yes 2/6

Cool Yes 1/4

Humidity High No 3/7 4/14

Normal Yes 1/7

Wind False Yes 2/8 5/14

True No* 3/6

Simplicity first!

26 26 Slide

Slide

Use OneR

Open file weather.nominal.arff

Choose OneR rule learner (rules>OneR) Look at the rule

Simplicity first!

27 27 Slide

Slide

OneR: One attribute does all the work

Incredibly simple method, described in 1993

“Very Simple Classification Rules Perform Well on Most Commonly Used Datasets”–

––

Experimental evaluation on 16 datasets

Used cross‐validationSimple rules often outperformed far more complex methods How can it work so well?

–

–some datasets really are simple

some are so small/noisy/complex learned from them!

that nothing can be

Rob Holte,Alberta, Canada

Simplicity first!

28 28 Slide

Slide

Any machine learning method may “overfit” the training data …

… by producing a classifier that fits the training data too tightly

Works well on training data but not on independent test data

Remember the “User classifier”? Imagine tediously putting a tinyaround every single training data point

circle

Overfitting is a general problem

… we illustrate it with OneR

Overfitting

29 29 Slide

Slide

Numeric attributes

OneR has a parameter that limits the complexity of such rules

Attribute Rules Errors

Total errors

Temp 85 No 0/1 0/14

80 Yes 0/1

83 Yes 0/1

75 No 0/1

… …

Outlook Temp Humidity Wind Play

Sunny 85 85 False No

Sunny 80 90 True No

Overcast 83 86 False Yes

Rainy 75 80 False Yes

… … … … …

Overfitting

30 30 Slide

Slide

Experiment with OneR

Open file weather.numeric.arff

Choose OneR rule learner (rules>OneR) Resulting rule is based on outlook attribute, Rule is based on humidity attribute

so remove outlook

humidity: < 82.5 >‐ yes>= 82.5 >‐ no

(10/14 instances correct)

Overfitting

31 31 Slide

Slide

Experiment with diabetes dataset

Open file diabetes.arff

Choose ZeroR rule learner (rules>ZeroR)Use cross‐validation:

Choose OneR rule learnerUse cross‐validation:

65.1%

(rules>OneR)72.1%

Look at the rule (plas = plasma glucose concentration)

Change minBucketSize parameter to 1: 54.9%Evaluate on training set:

Look at rule again86.6%

Overfitting

32 32 Slide

Slide

Overfitting is a general phenomenon that damages all ML methods

One reason why you must never evaluate on the training setOverfitting can occur more generallyE.g try many ML methods, choose the best for your data– you cannot expect to get the same performance on new test data

Divide data into training, test, validation sets?

Overfitting

33 33 Slide

Slide

(OneR: One attribute does all the work)

Opposite strategy: use all the attributes“Naïve Bayes” method

Two assumptions: Attributes are–

–equally important a priori

statistically independent (given the class value)i.e., knowing the value of one attribute says nothing about the value of another (if the class is known)

Independence assumption is never correct!

But … often works well in practice

Using probabilities

34 34 Slide

Slide

22

Probability of event H given evidence EPr[ E | H ]

Pr[ H ]E] Pr[H

class

|Pr[E ]

instance

Pr[ H ] is a priori probability of H– Probability of event before evidence is seen

Pr[ H | E ] is a posteriori probability of H– Probability of event after evidence is seen

“Naïve” assumption:– Evidence splits into parts that are independent

Pr[ E1 | H ] Pr[ E 2 | H ]... Pr[ E n | H ] Pr[ H ]Pr[ H | E ]

Pr[ E ]

Thomas Bayes, British mathematician, 1702 –1761

Using probabilities

35 35 Slide

Slide

True 3/9 3/5

Outlook Temperature Humidity Wind Play

Yes No Yes No Yes No Yes No Yes No

Sunny 2 3

Overcast 4 0

Rainy 3 2

Hot 2 2

Mild 4 2

Cool 3 1

High 3 4

Normal 6 1

False 6 2

True 3 3

9 5

Sunny 2/9 3/5

Overcast 4/9 0/5

Rainy 3/9 2/5

Hot 2/9 2/5

Mild 4/9 2/5

Cool 3/9 1/5

High 3/9 4/5

Normal 6/9 1/5

False 6/9 2/5 9/14 5/14

Using probabilities


Pr[ E ]

Outlook Temp Humidity Wind PlaySunny Hot High False

No

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild High False Yes

Rainy Cool Normal False Yes

Rainy Cool Normal True No

Overcast Cool Normal True Yes

Sunny Mild High False No

Sunny Cool Normal False Yes

Rainy Mild Normal False Yes

Sunny Mild Normal True Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes

Rainy Mild High True No

36 36 Slide

Slide

A new day:

Likelihood of the two classes

For “yes” = 2/9 3/9 3/9 3/9 9/14 = 0.0053

For “no” = 3/5 1/ 4/5 3/5 5/14 = 0.0206

Conversion into a probability by normalization:

P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205

P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795

Outlook Temp. Humidity Wind Play

Sunny Cool High True ?

Outlook Temperature Humidity Wind Play

Yes No Yes No Yes No Yes No Yes No

Sunny 2 3

Overcast 4 0

Rainy 3 2

Hot 2 2

Mild 4 2

Cool 3 1

High 3 4

Normal 6 1

False 6 2

True 3 3

9 5

Sunny 2/9 3/5

Overcast 4/9 0/5

Rainy 3/9 2/5

Hot 2/9 2/5

Mild 4/9 2/5

Cool 3/9 1/5

High 3/9 4/5

Normal 6/9 1/5

False 6/9 2/5

True 3/9 3/5

9/14 5/14

Using probabilities


Pr[ E ]

37 37 Slide

Slide

Evidence E

E] Sunny | yes]Pr[ yes |

Pr[Outlook

Pr[Temperature Cool | yes]

Pr[Humidity High | yes]

Pr[Windy True | yes]Probability of

class “yes”Pr[ yes]

Pr[E]

2 3 3 3 9 9 9 9 9 14

Pr[E]

Outlook Temp. Humidity Wind Play

Sunny Cool High True ?

Using probabilities

38 38 Slide

Slide

Use Naïve Bayes


Choose Naïve Bayes method (bayes>NaiveBayes) Look at the classifierAvoid zero frequencies: start all counts at 1

Using probabilities

39 39 Slide

Slide

“Naïve Bayes”: all attributes contribute equally and independently

Works surprisingly well– even if independence assumption is clearly violated

Why?– classification doesn’t need accurate probability estimatesso long as the greatest probability is assigned to the

Adding redundant attributes causes problems

(e.g. identical attributes) attribute selection

correct class

Using probabilities

40 40 Slide

Slide

alue

ode

Top down:‐ recursive divide‐and‐conquer

Select attribute for root node– Create branch for each possible attribute v

Split instances into subsets– One for each branch extending from the n

Repeat recursively for each branch– using only instances that reach the branch

Stop– if all instances have the same class

Decision Trees

41 41 Slide

Slide

Which attribute to select?

Decision Trees

42 42 Slide

Slide

Which is the best attribute?

Aim: to get the smallest tree

Heuristicchoose the attribute that produces the “purest” nodes

I.e. the greatest information gain–

–

Information theory: measure information in bits

entropy( p1 , p2 ,..., pn ) p1logp1 p2 logp2 ... pn logpn

Claude Shannon, American mathematician and scientist 1916–2001

Information gain Amount of information gained by knowing the value of the attribute (Entropy of distribution before the split) – (entropy of distribution after it)

Decision Trees

43 43 Slide

Slide

Which attribute to select?

0.048 bits 0.152 bits 0.029 bits0.247 bits

Decision Trees

44 44 Slide

Slide

Continue to split …

gain(temperature)

gain(windy)gain(humidity)

= 0.571 bits

= 0.020 bits= 0.971 bits

Decision Trees

45 45 Slide

Slide

Use J48 on the weather data


Choose J48 decision tree learnerLook at the tree

(trees>J48)

Use right click‐ menu to visualize the tree

Decision Trees

46 46 Slide

Slide

J48: “top d‐ own induction of decision trees”

Soundly based in information theory

Produces a tree that people can understand Many different criteria for attribute selection

– rarely make a large difference

Needs further modification to be useful in practice

Decision Trees

47 47 Slide

Slide

Pruning Decision Trees

48 48 Slide

Slide

Highly branching attributes — Extreme case: ID code


49 49 Slide

Slide

How to prune?

Don’t continue splitting if the nodes get very smallvalue 2)(J48 minNumObj parameter, default

Build full tree and then work back from the leaves, applyingstatistical test at each stage(confidenceFactor parameter, default value 0.25)

Sometimes it’s good to prune an interior node, raising the subtree beneath it up one level(subtreeRaising, default true)Messy … complicated … not particularly illuminating

a


50 50 Slide

Slide

Over fi‐ tting (again!)

Sometimes simplifying a decision tree gives better results

Open file diabetes.arff

Choose J48 decision

Prunes by default: Turn off pruning:

tree learner (trees>J48)

73.8% accuracy, tree has

72.7%

20 leaves, 39 nodes

22 leaves, 43 nodes

Extreme example:

Default (pruned): Unpruned:

breast‐cancer.arff

75.5% accuracy, tree has

4 leaves, 6 nodes

69.6% 152 leaves, 179 nodes


51 51 Slide

Slide

C4.5/J48 is a popular early machine learning method

Many different pruning methods– mainly change the size of the pruned tree

Pruning is a general technique that can apply tostructures other than trees (e.g. decision

Univariate vs. multivariate decision treesrules)

– Single vs. compound tests at the nodes

From C4.5 to J48

Ross Quinlan,Australian computer scientist


52 52 Slide

Slide

“Rote learning”: simplest form of learning

To classify a new instance, search training set for onethat’s “most like” it–

–the instances themselves represent the “knowledge”

lazy learning: do nothing until you have to make predictions “Instance based”‐ learning = “nearest neighbo‐ r” learning

Nearest neighbor

53 53 Slide

Slide

Nearest neighbor

54 54 Slide

Slide

Search training set for one that’s “most like” it

Need a similarity function–

–––

Regular (“Euclidean”) distance? (sum of squares of differences)

Manhattan (“city block”) di‐ stance? (sum of absolute differences) Nominal attributes? Distance = 1 if different, 0 if sameNormalize the attributes to lie between 0 and 1?

Nearest neighbor

55 55 Slide

Slide

What about noisy instances?

Nearest neighbor‐

k nea‐ rest neighbo‐ rs– choose majority class among

In Weka,several neighbors (k of them)

lazy>IBk (instance based‐ learning)

Nearest neighbor

56 56 Slide

Slide

Investigate effect of changing k

Glass dataset

lazy > IBk, k = 1, 5, 20

10‐fold cross‐validation

k = 1 k = 5 k = 20

70.6% 67.8% 65.4%

Nearest neighbor

57 57 Slide

Slide

Often very accurate … but slow:–

–scan entire training data to make each prediction?

sophisticated data structures can make this faster Assumes all attributes equally important

– Remedy: attribute selection or weights

Remedies against noisy instances:–

––

Majority vote over the k nearest neighbors

Weight instances according to prediction accuracyIdentify reliable “prototypes” for each class Statisticians have used k NN‐ since 1950sIf training set size n and k and k/n 0, error approaches minimum–

Nearest neighbor

1 1 Slide Classification. 2 2 Supervised vs. Unsupervised Learning n Supervised learning (classification) Supervision: The training data (observations,

Documents

computerthe data set

independent of training

set of measurements

training data observations

modeltest set

validation test set

nominalclassifies data

observationsnew data