1 Classification
Jan 14, 2016
1 1 Slide
Slide
Classification
2 2 Slide
Slide
Supervised vs. Unsupervised Learning
Supervised learning (classification)
• Supervision: The training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the observations
• New data is classified based on the training set
Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations,
etc. with the aim of establishing the existence
of classes or clusters in the data
3 3 Slide
Slide
Classification • predicts categorical class labels (discrete or
nominal)• classifies data (constructs a model) based on
the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Numeric Prediction • models continuous-valued functions, i.e.,
predicts unknown or missing values Typical applications
• Credit/loan approval:• Medical diagnosis: if a tumor is cancerous or
benign• Fraud detection: if a transaction is fraudulent• Web page categorization: which category it is
Prediction Problems: Classification vs. Numeric Prediction
4 4 Slide
Slide
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes• Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute• The set of tuples used for model construction is training
set• The model is represented as classification rules, decision
trees, or mathematical formula Model usage: for classifying future or unknown objects
• Estimate accuracy of the model• The known label of test sample is compared with the
classified result from the model• Accuracy rate is the percentage of test set samples
that are correctly classified by the model• Test set is independent of training set (otherwise
overfitting) • If the accuracy is acceptable, use the model to classify
new data Note: If the test set is used to select models, it is called
validation (test) set
5 5 Slide
Slide5
Process (1): Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6
THEN tenured = ‘yes’
Classifier(Model)
6 6 Slide
Slide6
Process (2): Using the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
7 7 Slide
Slide7
Decision Tree Induction: An Example
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
no
fairexcellentyesno
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
Training data set: Buys_computer The data set follows an example of
Quinlan’s ID3 Resulting tree:
8 8 Slide
Slide8
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)• Tree is constructed in a top-down recursive divide-
and-conquer manner• At start, all the training examples are at the root• Attributes are categorical (if continuous-valued,
they are discretized in advance)• Examples are partitioned recursively based on
selected attributes• Test attributes are selected on the basis of a
heuristic or statistical measure (e.g., information gain)
9 9 Slide
Slide
Classification ExampleMBA Application
10 10 Slide
Slide
Plan
Discussion on Classification using Data Mining Tool (It’s suggested to work with your final project teammates.)
11 11 Slide
Slide
Decisions on Admissions
Download data mining tool—Weka.zip and an admission dataset ABC.csv from Course Page.
Work as a team to make decisions on admissions to MBA program at ABC University. Remember to save your file.• Assuming that each record (row) represents
a student, use the provided information (GPA, GMAT, Years of work experience) to make admission decisions.• If you decide to admit the student, fill in
Yes in Admission column. Otherwise, fill in No to reject the application.
12 12 Slide
Slide
Decisions on Admissions (cont.)
Before using Weka, write down your criteria judging the applications.
13 13 Slide
Slide
Decision Tree
A machine learning (data mining) technique. Making decisions. Generating interpretable rules. Able to re-use for future predictions. Automatic and faster response (decision).
14 14 Slide
Slide
Using Weka
Extract Weka.zip to desktop Run the “RunWeka3-4-6.bat” from Weka folder
on your desktop Open the ABC.csv with decisions from Weka
• Set filetype to CSV
15 15 Slide
Slide
Using Weka (cont.)
Click Classify tab Choose J48 Classifier below trees Set the Test options to Use training set Enable Output predictions in More options Click Start to run
16 16 Slide
Slide
Using Weka (cont.)
Read the classification output report from Classifier output
17 17 Slide
Slide
Using Weka (cont.)
Use the decision tree generated from the tool for your write-up (from J48 pruned tree)
18 18 Slide
Slide
Using Weka (cont.)
The generated tree to my example is• GMAT <= 390: No (9.0)• GMAT > 390• | Exp <= 2• | | GMAT <= 450: No (2.0)• | | GMAT > 450: Yes (6.0/1.0)• | Exp > 2: Yes (13.0)
YesNo
YesNo
Rejected
Admitted
YesNo
Admitted
Rejected
GMAT>390
Exp > 2
GMAT>450
19 19 Slide
Slide
Using Weka (cont.)
Performance of Decision Tree
Incorrect prediction
20 20 Slide
Slide
Using Weka (cont.)
Accuracy rate for decision tree prediction =(the number of correct predictions) / (total observations)= 29 / 30= 96.6667%
21 21 Slide
Slide
Using Weka (cont.)
Share your criteria (results) with other teams.
22 22 Slide
Slide
Simple algorithms often work very well!
There are many kinds of simple structure, eg:–
––––
One attribute does all the work
Attributes contribute equally and independently
A decision tree that tests a few attributes
Calculate distance from training instancesResult depends on a linear combination of attributes
Success of method depends on the domain– Data mining is an experimental science
Simplicity first!
23 23 Slide
Slide
OneR: One attribute does all the work
Learn a 1‐level “decision tree”– i.e., rules that all test one particular attribute
Basic version–
––
One branch for each value
Each branch assigns most frequent classError rate: proportionmajority class of their
Choose attribute with
of instances that don’t belong to thecorresponding branch
smallest error rate–
Simplicity first!
24 24 Slide
Slide
For each attribute,For each value of the attribute, make a rule as follows:
count how often each class appears Find the most frequent classMake the rule assign that class to this attribute-value
Calculate the error rate of this attribute’s rules
Choose the attribute with the smallest error rate
Simplicity first!
25 25 Slide
Slide
* indicates a tie
Outlook Temp Humidity Wind PlaySunny Hot High False
No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
Attribute Rules Errors Total error
sOutlook Sunny No 2/5 4/14
Overcast Yes 0/4
Rainy Yes 2/5
Temp Hot No* 2/4 5/14
Mild Yes 2/6
Cool Yes 1/4
Humidity High No 3/7 4/14
Normal Yes 1/7
Wind False Yes 2/8 5/14
True No* 3/6
Simplicity first!
26 26 Slide
Slide
Use OneR
Open file weather.nominal.arff
Choose OneR rule learner (rules>OneR) Look at the rule
Simplicity first!
27 27 Slide
Slide
OneR: One attribute does all the work
Incredibly simple method, described in 1993
“Very Simple Classification Rules Perform Well on Most Commonly Used Datasets”–
––
Experimental evaluation on 16 datasets
Used cross‐validationSimple rules often outperformed far more complex methods How can it work so well?
–
–some datasets really are simple
some are so small/noisy/complex learned from them!
that nothing can be
Rob Holte,Alberta, Canada
Simplicity first!
28 28 Slide
Slide
Any machine learning method may “overfit” the training data …
… by producing a classifier that fits the training data too tightly
Works well on training data but not on independent test data
Remember the “User classifier”? Imagine tediously putting a tinyaround every single training data point
circle
Overfitting is a general problem
… we illustrate it with OneR
Overfitting
29 29 Slide
Slide
Numeric attributes
OneR has a parameter that limits the complexity of such rules
Attribute Rules Errors
Total errors
Temp 85 No 0/1 0/14
80 Yes 0/1
83 Yes 0/1
75 No 0/1
… …
Outlook Temp Humidity Wind Play
Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes
… … … … …
Overfitting
30 30 Slide
Slide
Experiment with OneR
Open file weather.numeric.arff
Choose OneR rule learner (rules>OneR) Resulting rule is based on outlook attribute, Rule is based on humidity attribute
so remove outlook
humidity: < 82.5 >‐ yes>= 82.5 >‐ no
(10/14 instances correct)
Overfitting
31 31 Slide
Slide
Experiment with diabetes dataset
Open file diabetes.arff
Choose ZeroR rule learner (rules>ZeroR)Use cross‐validation:
Choose OneR rule learnerUse cross‐validation:
65.1%
(rules>OneR)72.1%
Look at the rule (plas = plasma glucose concentration)
Change minBucketSize parameter to 1: 54.9%Evaluate on training set:
Look at rule again86.6%
Overfitting
32 32 Slide
Slide
Overfitting is a general phenomenon that damages all ML methods
One reason why you must never evaluate on the training setOverfitting can occur more generallyE.g try many ML methods, choose the best for your data– you cannot expect to get the same performance on new test data
Divide data into training, test, validation sets?
Overfitting
33 33 Slide
Slide
(OneR: One attribute does all the work)
Opposite strategy: use all the attributes“Naïve Bayes” method
Two assumptions: Attributes are–
–equally important a priori
statistically independent (given the class value)i.e., knowing the value of one attribute says nothing about the value of another (if the class is known)
Independence assumption is never correct!
But … often works well in practice
Using probabilities
34 34 Slide
Slide
22
Probability of event H given evidence EPr[ E | H ]
Pr[ H ]E] Pr[H
class
|Pr[E ]
instance
Pr[ H ] is a priori probability of H– Probability of event before evidence is seen
Pr[ H | E ] is a posteriori probability of H– Probability of event after evidence is seen
“Naïve” assumption:– Evidence splits into parts that are independent
Pr[ E1 | H ] Pr[ E 2 | H ]... Pr[ E n | H ] Pr[ H ]Pr[ H | E ]
Pr[ E ]
Thomas Bayes, British mathematician, 1702 –1761
Using probabilities
35 35 Slide
Slide
True 3/9 3/5
Outlook Temperature Humidity Wind Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3
Overcast 4 0
Rainy 3 2
Hot 2 2
Mild 4 2
Cool 3 1
High 3 4
Normal 6 1
False 6 2
True 3 3
9 5
Sunny 2/9 3/5
Overcast 4/9 0/5
Rainy 3/9 2/5
Hot 2/9 2/5
Mild 4/9 2/5
Cool 3/9 1/5
High 3/9 4/5
Normal 6/9 1/5
False 6/9 2/5 9/14 5/14
Using probabilities
Pr[ E1 | H ] Pr[ E 2 | H ]... Pr[ E n | H ] Pr[ H ]Pr[ H | E ]
Pr[ E ]
Outlook Temp Humidity Wind PlaySunny Hot High False
No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
36 36 Slide
Slide
A new day:
Likelihood of the two classes
For “yes” = 2/9 3/9 3/9 3/9 9/14 = 0.0053
For “no” = 3/5 1/ 4/5 3/5 5/14 = 0.0206
Conversion into a probability by normalization:
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795
Outlook Temp. Humidity Wind Play
Sunny Cool High True ?
Outlook Temperature Humidity Wind Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3
Overcast 4 0
Rainy 3 2
Hot 2 2
Mild 4 2
Cool 3 1
High 3 4
Normal 6 1
False 6 2
True 3 3
9 5
Sunny 2/9 3/5
Overcast 4/9 0/5
Rainy 3/9 2/5
Hot 2/9 2/5
Mild 4/9 2/5
Cool 3/9 1/5
High 3/9 4/5
Normal 6/9 1/5
False 6/9 2/5
True 3/9 3/5
9/14 5/14
Using probabilities
Pr[ E1 | H ] Pr[ E 2 | H ]... Pr[ E n | H ] Pr[ H ]Pr[ H | E ]
Pr[ E ]
37 37 Slide
Slide
Evidence E
E] Sunny | yes]Pr[ yes |
Pr[Outlook
Pr[Temperature Cool | yes]
Pr[Humidity High | yes]
Pr[Windy True | yes]Probability of
class “yes”Pr[ yes]
Pr[E]
2 3 3 3 9 9 9 9 9 14
Pr[E]
Outlook Temp. Humidity Wind Play
Sunny Cool High True ?
Using probabilities
38 38 Slide
Slide
Use Naïve Bayes
Open file weather.nominal.arff
Choose Naïve Bayes method (bayes>NaiveBayes) Look at the classifierAvoid zero frequencies: start all counts at 1
Using probabilities
39 39 Slide
Slide
“Naïve Bayes”: all attributes contribute equally and independently
Works surprisingly well– even if independence assumption is clearly violated
Why?– classification doesn’t need accurate probability estimatesso long as the greatest probability is assigned to the
Adding redundant attributes causes problems
(e.g. identical attributes) attribute selection
correct class
Using probabilities
40 40 Slide
Slide
alue
ode
Top down:‐ recursive divide‐and‐conquer
Select attribute for root node– Create branch for each possible attribute v
Split instances into subsets– One for each branch extending from the n
Repeat recursively for each branch– using only instances that reach the branch
Stop– if all instances have the same class
Decision Trees
41 41 Slide
Slide
Which attribute to select?
Decision Trees
42 42 Slide
Slide
Which is the best attribute?
Aim: to get the smallest tree
Heuristicchoose the attribute that produces the “purest” nodes
I.e. the greatest information gain–
–
Information theory: measure information in bits
entropy( p1 , p2 ,..., pn ) p1logp1 p2 logp2 ... pn logpn
Claude Shannon, American mathematician and scientist 1916–2001
Information gain Amount of information gained by knowing the value of the attribute (Entropy of distribution before the split) – (entropy of distribution after it)
Decision Trees
43 43 Slide
Slide
Which attribute to select?
0.048 bits 0.152 bits 0.029 bits0.247 bits
Decision Trees
44 44 Slide
Slide
Continue to split …
gain(temperature)
gain(windy)gain(humidity)
= 0.571 bits
= 0.020 bits= 0.971 bits
Decision Trees
45 45 Slide
Slide
Use J48 on the weather data
Open file weather.nominal.arff
Choose J48 decision tree learnerLook at the tree
(trees>J48)
Use right click‐ menu to visualize the tree
Decision Trees
46 46 Slide
Slide
J48: “top d‐ own induction of decision trees”
Soundly based in information theory
Produces a tree that people can understand Many different criteria for attribute selection
– rarely make a large difference
Needs further modification to be useful in practice
Decision Trees
47 47 Slide
Slide
Pruning Decision Trees
48 48 Slide
Slide
Highly branching attributes — Extreme case: ID code
Pruning Decision Trees
49 49 Slide
Slide
How to prune?
Don’t continue splitting if the nodes get very smallvalue 2)(J48 minNumObj parameter, default
Build full tree and then work back from the leaves, applyingstatistical test at each stage(confidenceFactor parameter, default value 0.25)
Sometimes it’s good to prune an interior node, raising the subtree beneath it up one level(subtreeRaising, default true)Messy … complicated … not particularly illuminating
a
Pruning Decision Trees
50 50 Slide
Slide
Over fi‐ tting (again!)
Sometimes simplifying a decision tree gives better results
Open file diabetes.arff
Choose J48 decision
Prunes by default: Turn off pruning:
tree learner (trees>J48)
73.8% accuracy, tree has
72.7%
20 leaves, 39 nodes
22 leaves, 43 nodes
Extreme example:
Default (pruned): Unpruned:
breast‐cancer.arff
75.5% accuracy, tree has
4 leaves, 6 nodes
69.6% 152 leaves, 179 nodes
Pruning Decision Trees
51 51 Slide
Slide
C4.5/J48 is a popular early machine learning method
Many different pruning methods– mainly change the size of the pruned tree
Pruning is a general technique that can apply tostructures other than trees (e.g. decision
Univariate vs. multivariate decision treesrules)
– Single vs. compound tests at the nodes
From C4.5 to J48
Ross Quinlan,Australian computer scientist
Pruning Decision Trees
52 52 Slide
Slide
“Rote learning”: simplest form of learning
To classify a new instance, search training set for onethat’s “most like” it–
–the instances themselves represent the “knowledge”
lazy learning: do nothing until you have to make predictions “Instance based”‐ learning = “nearest neighbo‐ r” learning
Nearest neighbor
53 53 Slide
Slide
Nearest neighbor
54 54 Slide
Slide
Search training set for one that’s “most like” it
Need a similarity function–
–––
Regular (“Euclidean”) distance? (sum of squares of differences)
Manhattan (“city block”) di‐ stance? (sum of absolute differences) Nominal attributes? Distance = 1 if different, 0 if sameNormalize the attributes to lie between 0 and 1?
Nearest neighbor
55 55 Slide
Slide
What about noisy instances?
Nearest neighbor‐
k nea‐ rest neighbo‐ rs– choose majority class among
In Weka,several neighbors (k of them)
lazy>IBk (instance based‐ learning)
Nearest neighbor
56 56 Slide
Slide
Investigate effect of changing k
Glass dataset
lazy > IBk, k = 1, 5, 20
10‐fold cross‐validation
k = 1 k = 5 k = 20
70.6% 67.8% 65.4%
Nearest neighbor
57 57 Slide
Slide
Often very accurate … but slow:–
–scan entire training data to make each prediction?
sophisticated data structures can make this faster Assumes all attributes equally important
– Remedy: attribute selection or weights
Remedies against noisy instances:–
––
Majority vote over the k nearest neighbors
Weight instances according to prediction accuracyIdentify reliable “prototypes” for each class Statisticians have used k NN‐ since 1950sIf training set size n and k and k/n 0, error approaches minimum–
Nearest neighbor