Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/26/2006 1 Classification: Alternative Techniques Rule-based Classifier Introduction to Data Mining 08/26/2006 2
Classification: Alternative Techniques
Dr. Hui XiongRutgers University
Introduction to Data Mining 08/26/2006 1
Classification: Alternative Techniques
Rule-based Classifier
Introduction to Data Mining 08/26/2006 2
Rule-Based Classifier
Classify records by using a collection of “if…then…” rules
Rule: (Condition) → y– where
Condition is a conjunctions of attributes y is the class label
– LHS: rule antecedent or condition
Introduction to Data Mining 08/26/2006 3
– RHS: rule consequent– Examples of classification rules:
(Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds(Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No
Rule-based Classifier (Example)
Introduction to Data Mining 08/26/2006 4
Application of Rule-Based Classifier
A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule
Introduction to Data Mining 08/26/2006 5
The rule r1 covers a hawk => BirdThe rule r3 covers the grizzly bear => Mammal
Rule Coverage and Accuracy
Coverage of a rule:– Fraction of records
th t ti f th
Tid Refund Marital Status
Taxable Income Class
1 Yes Single 125K No
2 No Married 100K Nothat satisfy the antecedent of a rule
Accuracy of a rule:– Fraction of records
that satisfy both the t d t d
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
Introduction to Data Mining 08/26/2006 6
antecedent and consequent of a rule
9 No Married 75K No
10 No Single 90K Yes 10
(Status=Single) → No
Coverage = 40%, Accuracy = 50%
How does Rule-based Classifier Work?
Introduction to Data Mining 08/26/2006 7
A lemur triggers rule r3, so it is classified as a mammalA turtle triggers both r4 and r5A dogfish shark triggers none of the rules
Characteristics of Rule-Based Classifier
Mutually exclusive rules– Classifier contains mutually exclusive rules if
th l i d d t f h ththe rules are independent of each other– Every record is covered by at most one rule
Exhaustive rules– Classifier has exhaustive coverage if it
Introduction to Data Mining 08/26/2006 8
gaccounts for every possible combination of attribute values
– Each record is covered by at least one rule
From Decision Trees To Rules
Refund
Classification Rules
(Refund=Yes) ==> No
NONO
NONO
Yes No
{Married}{Single,
Divorced}
< 80K > 80K
Taxable Income
Marital Status
(Refund=No, Marital Status={Single,Divorced},Taxable Income<80K) ==> No
(Refund=No, Marital Status={Single,Divorced},Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No
Introduction to Data Mining 08/26/2006 9
YESYESNONO Rules are mutually exclusive and exhaustive
Rule set contains as much information as the tree
Rules Can Be Simplified
Yes No
Refund
Tid Refund Marital Status
Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
YESYESNONO
NONO
NONO
{Married}{Single,
Divorced}
< 80K > 80K
Taxable Income
Marital Status
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
Introduction to Data Mining 08/26/2006 10
9 No Married 75K No
10 No Single 90K Yes 10
Initial Rule: (Refund=No) ∧ (Status=Married) → No
Simplified Rule: (Status=Married) → No
Effect of Rule Simplification
Rules are no longer mutually exclusive– A record may trigger more than one rule – Solution?
Ordered rule setUnordered rule set – use voting schemes
Rules are no longer exhaustive
Introduction to Data Mining 08/26/2006 11
g– A record may not trigger any rules– Solution?
Use a default class
Ordered Rule Set
Rules are rank ordered according to their priority– An ordered rule set is known as a decision list
When a test record is presented to the classifier – It is assigned to the class label of the highest ranked rule it has
triggered– If none of the rules fired, it is assigned to the default class
Introduction to Data Mining 08/26/2006 12
Rule Ordering Schemes
Rule-based ordering– Individual rules are ranked based on their quality/priority
Class-based ordering– Rules that belong to the same class appear together
Introduction to Data Mining 08/26/2006 13
Building Classification Rules
Direct Method: Extract rules directly from data
RIPPER CN2 1R d AQe.g.: RIPPER, CN2, 1R, and AQ
Indirect Method:Extract rules from other classification models (e.g. decision trees, neural networks, SVM, etc).
C4 5 l
Introduction to Data Mining 08/26/2006 14
e.g: C4.5rules
Direct Method: Sequential Covering
Introduction to Data Mining 08/26/2006 15
Example of Sequential Covering
Introduction to Data Mining 08/26/2006 16
(ii) Step 1
Example of Sequential Covering…
R1 R1
R2
Introduction to Data Mining 08/26/2006 17
(iii) Step 2 (iv) Step 3
Aspects of Sequential Covering
Rule Growing– Rule evaluation
Instance Elimination
Stopping Criterion
Introduction to Data Mining 08/26/2006 18
Rule Pruning
Rule Growing
Two common strategies
Status =Single
Status =Divorced
Status =Married
Income> 80K...
Yes: 3No: 4{ }
Yes: 0N 3
Refund=No
Yes: 3N 4
Yes: 2N 1
Yes: 1N 0
Yes: 3N 1
Introduction to Data Mining 08/26/2006 19
No: 3No: 4 No: 1 No: 0 No: 1
(a) General-to-specific
Rule Evaluation
Evaluation metric determines which conjunct should be added during rule growing
n– Accuracy
– Laplacekn
nc
++
=1
n : Number of instances covered by rule
nc : Number of instances of class c covered by rule
nnc=
Introduction to Data Mining 08/26/2006 20
– M-estimate
kn +
knkpnc
++
=
k : Number of classes
p : Prior probability
Rule Growing (Examples)
CN2 Algorithm:– Start from an empty conjunct: {}– Add conjuncts that minimizes the entropy measure: {A}, {A,B}, …
D t i th l t b t ki j it l f i t– Determine the rule consequent by taking majority class of instances covered by the rule
RIPPER Algorithm:– Start from an empty rule: {} => class– Add conjuncts that maximizes FOIL’s information gain measure:
R0: {} => class (initial rule)R1: {A} => class (rule after adding conjunct)Gain(R0 R1) = t [ log (p1/(p1+n1)) log (p0/(p0 + n0)) ]
Introduction to Data Mining 08/26/2006 21
Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ]where t: number of positive instances covered by both R0 and R1
p0: number of positive instances covered by R0n0: number of negative instances covered by R0p1: number of positive instances covered by R1n1: number of negative instances covered by R1
Instance Elimination
Why do we need to eliminate instances?
– Otherwise, the next rule is identical to previous ruleidentical to previous rule
Why do we remove positive instances?
– Ensure that the next rule is different
Why do we remove negative instances?
Introduction to Data Mining 08/26/2006 22
g– Prevent underestimating
accuracy of rule– Compare rules R2 and R3
in the diagram
Stopping Criterion and Rule Pruning
Examples of stopping criterion:– If rule does not improve significantly after adding
conjunctconjunct– If rule starts covering examples from another class
Rule Pruning– Similar to post-pruning of decision trees– Example: using validation set (reduced error pruning)
Introduction to Data Mining 08/26/2006 23
a p e us g a da o se ( educed e o p u g)Remove one of the conjuncts in the rule Compare error rate on validation set before and after pruningIf error improves, prune the conjunct
Summary of Direct Method
Initial rule set is empty
Repeat– Grow a single rule– Remove Instances covered by the rule– Prune the rule (if necessary)– Add rule to the current rule set
Introduction to Data Mining 08/26/2006 24
– Add rule to the current rule set
Direct Method: RIPPER
For 2-class problem, choose one of the classes as positive class, and the other as negative class– Learn the rules for positive classLearn the rules for positive class– Use negative class as default
For multi-class problem– Order the classes according to increasing class
prevalence (fraction of instances that belong to a
Introduction to Data Mining 08/26/2006 25
particular class)– Learn the rule set for smallest class first, treat the rest
as negative class– Repeat with next smallest class as positive class
Direct Method: RIPPER
Rule growing:– Start from an empty rule: {} → +– Add conjuncts as long as they improve FOIL’s j g y p
information gain– Stop when rule no longer covers negative examples– Prune the rule immediately using incremental reduced
error pruning– Measure for pruning: v = (p-n)/(p+n)
p: number of positive examples covered by the rule in
Introduction to Data Mining 08/26/2006 26
p p p ythe validation set
n: number of negative examples covered by the rule inthe validation set
– Pruning method: delete any final sequence of conditions that maximizes v
Direct Method: RIPPER
Building a Rule Set:– Use sequential covering algorithm
Grow a rule to cover the current set of positive examples
Eliminate both positive and negative examples covered by the rule
– Each time a rule is added to the rule set, compute the new description length
Introduction to Data Mining 08/26/2006 27
compute the new description lengthstop adding new rules when the new description
length is d bits longer than the smallest description length obtained so far
Indirect Methods
Introduction to Data Mining 08/26/2006 28
Indirect Method: C4.5rules
Extract rules for every path from root to leaf nodesFor each rule, r: A → y, – consider alternative rule r’: A’ → y where A’ is
obtained by removing one of the conjuncts in A– Compare the pessimistic error rate for r against
all r’sPrune if one of the r’s has lower pessimistic error t
Introduction to Data Mining 08/26/2006 29
rate
– Repeat until pessimistic error rate can no longer be improved
Indirect Method: C4.5rules
Use class-based ordering– Rules that predict the same class are grouped
t th i t th b ttogether into the same subset– Compute total description length for each
class– Classes are ordered in increasing order of
their total description length
Introduction to Data Mining 08/26/2006 30
Example
C4.5rules:(Give Birth=No, Can Fly=Yes) → Birds
(Give Birth=No, Live in Water=Yes) → Fishes
GiveBirth?
Yes No ( , )
(Give Birth=Yes) → Mammals
(Give Birth=No, Can Fly=No, Live in Water=No) → Reptiles
( ) → Amphibians
Live InWater?
Can
Mammals
Fishes Amphibians
Yes
Sometimes
No
Introduction to Data Mining 08/26/2006 31
CanFly?
Fishes Amphibians
Birds Reptiles
Yes No
Characteristics of Rule-Based Classifiers
As highly expressive as decision treesEasy to interpretEasy to generateCan classify new instances rapidlyPerformance comparable to decision trees
Introduction to Data Mining 08/26/2006 32
Classification: Alternative Techniques
Instance-Based Classifiers
Introduction to Data Mining 08/26/2006 33
Instance-Based Classifiers
Atr1 ……... AtrN Class
Set of Stored Cases • Store the training records
• Use training records to predict the class label of
A
B
B
C
A Atr1 ……... AtrN
Unseen Case
predict the class label of unseen cases
Introduction to Data Mining 08/26/2006 34
C
B
Instance Based Classifiers
Examples:– Rote-learner
Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly
– Nearest neighborU k “ l t” i t ( t i hb ) f
Introduction to Data Mining 08/26/2006 35
Uses k “closest” points (nearest neighbors) for performing classification
Nearest Neighbor Classifiers
Basic idea:– If it walks like a duck, quacks like a duck, then
it’ b bl d kit’s probably a duck
Test Record
Compute Distance
Introduction to Data Mining 08/26/2006 36
Training Records
Choose k of the “nearest” records
Nearest-Neighbor Classifiers
Requires three things– The set of stored records– Distance metric to compute
distance between records
Unknown record
distance between records– The value of k, the number of
nearest neighbors to retrieve
To classify an unknown record:– Compute distance to other
training recordsId tif k t i hb
Introduction to Data Mining 08/26/2006 37
– Identify k nearest neighbors – Use class labels of nearest
neighbors to determine the class label of unknown record (e.g., by taking majority vote)
Definition of Nearest Neighbor
X X X
Introduction to Data Mining 08/26/2006 38
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points that have the k smallest distance to x
1 nearest-neighbor
Voronoi Diagram
Introduction to Data Mining 08/26/2006 39
Nearest Neighbor Classification
Compute distance between two points:– Example: Euclidean distance
Determine the class from nearest neighbor list– take the majority vote of class labels among
∑ −=i ii
qpqpd 2)(),(
Introduction to Data Mining 08/26/2006 40
– take the majority vote of class labels among the k-nearest neighbors
– Weigh the vote according to distanceweight factor, w = 1/d2
Nearest Neighbor Classification…
Choosing the value of k:– If k is too small, sensitive to noise points
If k i t l i hb h d i l d i t f– If k is too large, neighborhood may include points from other classes
Introduction to Data Mining 08/26/2006 41
Nearest Neighbor Classification…
Scaling issues– Attributes may have to be scaled to prevent
di t f b i d i t d bdistance measures from being dominated by one of the attributes
– Example:height of a person may vary from 1.5m to 1.8mweight of a person may vary from 90lb to 300lb
$ $
Introduction to Data Mining 08/26/2006 42
income of a person may vary from $10K to $1M
Nearest Neighbor Classification…
Problem with Euclidean measure:– High dimensional data
curse of dimensionality
– Can produce counter-intuitive results
1 1 1 1 1 1 1 1 1 1 1 0
0 1 1 1 1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1vs
Introduction to Data Mining 08/26/2006 43
0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1
d = 1.4142 d = 1.4142
Solution: Normalize the vectors to unit length
Nearest neighbor Classification…
k-NN classifiers are lazy learners – It does not build models explicitly– Unlike eager learners such as decision tree
induction and rule-based systems– Classifying unknown records are relatively
expensive
Introduction to Data Mining 08/26/2006 44
Example: PEBLS
PEBLS: Parallel Examplar-Based Learning System (Cost & Salzberg)
W k ith b th ti d i l– Works with both continuous and nominal features
For nominal features, distance between two nominal values is computed using modified value difference metric (MVDM)
– Each record is assigned a weight factor
Introduction to Data Mining 08/26/2006 45
Each record is assigned a weight factor– Number of nearest neighbor, k = 1
Example: PEBLS
Distance between nominal attribute values:
d(Single,Married) = | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No d(Single,Divorced) = | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0d(Married,Divorced) = | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1d(Refund=Yes,Refund=No)= | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Introduction to Data Mining 08/26/2006 46
ClassMarital Status
Single Married Divorced
Yes 2 0 1
No 2 4 1
∑ −=i
ii
nn
nnVVd
2
2
1
121 ),(Class
Refund
Yes No
Yes 0 3
No 3 4
Example: PEBLS
Tid Refund Marital Status
Taxable Income Cheat
X Yes Single 125K No
Y No Married 100K No
∑=
=Δd
iiiYX YXdwwYX
1
2),(),(
Y No Married 100K No10
Distance between record X and record Y:
where: predictionfor usedisXtimesofNumber w
Introduction to Data Mining 08/26/2006 47
correctly predicts X timesofNumber p
=Xw
wX ≅ 1 if X makes accurate prediction most of the time
wX > 1 if X is not reliable for making predictions
Classification: Alternative Techniques
Bayesian Classifiers
Introduction to Data Mining 08/26/2006 48
Classification: Alternative Techniques
Ensemble Methods
Introduction to Data Mining 8/26/2005 49
Ensemble Methods
Construct a set of classifiers from the training data
Predict class label of test records by combining the predictions made by multiple classifiers
Introduction to Data Mining 08/26/2006 50
Why Ensemble Methods work?
Suppose there are 25 base classifiers– Each classifier hasEach classifier has
error rate, ε = 0.35– Assume errors made
by classifiers are uncorrelated
– Probability that the ensemble classifier makes
Introduction to Data Mining 08/26/2006 51
ensemble classifier makes a wrong prediction:
∑=
− =−⎟⎟⎠
⎞⎜⎜⎝
⎛=≥
25
13
25 06.0)1(25
)13(i
ii
iXP εε
General Approach
Introduction to Data Mining 08/26/2006 52
Types of Ensemble Methods
Bayesian ensemble– Example: Mixture of Gaussian
Manipulate data distributionManipulate data distribution– Example: Resampling method
Manipulate input features– Example: Feature subset selection
Manipulate class labels
Introduction to Data Mining 08/26/2006 53
– Example: error-correcting output codingIntroduce randomness into learning algorithm– Example: Random forests
Bagging
Sampling with replacement
Build classifier on each bootstrap sample
Original Data 1 2 3 4 5 6 7 8 9 10Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
Introduction to Data Mining 08/26/2006 54
Each sample has probability (1 – 1/n)n of being selected
Bagging Algorithm
Introduction to Data Mining 08/26/2006 55
Bagging Example
Consider 1-dimensional data set:
Classifier is a decision stump– Decision rule: x ≤ k versus x > k– Split point k is chosen based on entropy
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1y 1 1 1 -1 -1 -1 -1 1 1 1
Introduction to Data Mining 08/26/2006 56
– Split point k is chosen based on entropy
x ≤ k
yleft yright
True False
Bagging Example
Bagging Round 1:x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9y 1 1 1 1 -1 -1 -1 -1 1 1
B i R d 2Bagging Round 2:x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1y 1 1 1 -1 -1 -1 1 1 1 1
Bagging Round 3:x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9y 1 1 1 -1 -1 -1 -1 -1 1 1
Bagging Round 4:x 0 1 0 1 0 2 0 4 0 4 0 5 0 5 0 7 0 8 0 9
Introduction to Data Mining 08/26/2006 57
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9y 1 1 1 -1 -1 -1 -1 -1 1 1
Bagging Round 5:x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging Example
Bagging Round 6:x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1y 1 -1 -1 -1 -1 -1 -1 1 1 1
B i R d 7Bagging Round 7:x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1y 1 -1 -1 -1 -1 1 1 1 1 1
Bagging Round 8:x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1y 1 1 -1 -1 -1 -1 -1 1 1 1
Bagging Round 9:x 0 1 0 3 0 4 0 4 0 6 0 7 0 7 0 8 1 1
Introduction to Data Mining 08/26/2006 58
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1y 1 1 -1 -1 -1 -1 -1 1 1 1
Bagging Round 10:x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9y 1 1 1 1 1 1 1 1 1 1
Bagging Example
Summary of Training sets:
Round Split Point Left Class Right Class1 0.35 1 -12 0.7 1 13 0.35 1 -14 0.3 1 -15 0.35 1 -16 0.75 -1 17 0.75 -1 18 0 75 1 1
Introduction to Data Mining 08/26/2006 59
8 0.75 -1 19 0.75 -1 110 0.05 1 1
Bagging Example
Assume test set is the same as the original dataUse majority vote to determine class of ensemble l ificlassifier
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.01 1 1 1 -1 -1 -1 -1 -1 -1 -12 1 1 1 1 1 1 1 1 1 13 1 1 1 -1 -1 -1 -1 -1 -1 -14 1 1 1 -1 -1 -1 -1 -1 -1 -15 1 1 1 -1 -1 -1 -1 -1 -1 -16 -1 -1 -1 -1 -1 -1 -1 1 1 1
Introduction to Data Mining 08/26/2006 60
7 -1 -1 -1 -1 -1 -1 -1 1 1 18 -1 -1 -1 -1 -1 -1 -1 1 1 19 -1 -1 -1 -1 -1 -1 -1 1 1 110 1 1 1 1 1 1 1 1 1 1
Sum 2 2 2 -6 -6 -6 -6 2 2 2Sign 1 1 1 -1 -1 -1 -1 1 1 1Predicted
Class
Boosting
An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified recordspreviously misclassified records– Initially, all N records are assigned equal
weights– Unlike bagging, weights may change at the
end of each boosting round
Introduction to Data Mining 08/26/2006 61
Boosting
Records that are wrongly classified will have their weights increasedR d th t l ifi d tl ill hRecords that are classified correctly will have their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
Introduction to Data Mining 08/26/2006 62
• Example 4 is hard to classify
• Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds
AdaBoost
Base classifiers: C1, C2, …, CT
Error rate:
Importance of a classifier:
( )∑=
≠=N
jjjiji yxCw
N 1
)(1 δε
Introduction to Data Mining 08/26/2006 63
⎟⎟⎠
⎞⎜⎜⎝
⎛ −=
i
ii ε
εα 1ln21
AdaBoost Algorithm
Weight update:
)(ifexp)()1( iij
jij yxCw j⎪
⎨⎧ =−
+α
If any intermediate rounds produce error rate higher than 50%, the weights are reverted back
factor ionnormalizat theis where
)( ifexp)(
j
iij
iij
j
iji
Z
yxCy
Zw
j⎪⎩⎨
≠= α
Introduction to Data Mining 08/26/2006 64
to 1/n and the resampling procedure is repeatedClassification:
( )∑=
==T
jjj
yyxCxC
1
)(maxarg)(* δα
AdaBoost Algorithm
Introduction to Data Mining 08/26/2006 65
AdaBoost Example
Consider 1-dimensional data set:
Classifier is a decision stump– Decision rule: x ≤ k versus x > k– Split point k is chosen based on entropy
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1y 1 1 1 -1 -1 -1 -1 1 1 1
Introduction to Data Mining 08/26/2006 66
– Split point k is chosen based on entropy
x ≤ k
yleft yright
True False
AdaBoost Example
Training sets for the first 3 boosting rounds:Boosting Round 1:
x 0 1 0 4 0 5 0 6 0 6 0 7 0 7 0 7 0 8 1x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1y 1 -1 -1 -1 -1 -1 -1 -1 1 1
Boosting Round 2:x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3y 1 1 1 1 1 1 1 1 1 1
Boosting Round 3:x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7y 1 1 -1 -1 -1 -1 -1 -1 -1 -1
Introduction to Data Mining 08/26/2006 67
Summary:
y
Round Split Point Left Class Right Class alpha1 0.75 -1 1 1.7382 0.05 1 1 2.77843 0.3 1 -1 4.1195
AdaBoost Example
WeightsRound x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Classification
2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.013 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.01 -1 -1 -1 -1 -1 -1 -1 1 1 1
Introduction to Data Mining 08/26/2006 68
2 1 1 1 1 1 1 1 1 1 13 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397Sign 1 1 1 -1 -1 -1 -1 1 1 1Predicted
Class
Classification: Alternative Techniques
Imbalanced Class Problem
Introduction to Data Mining 8/26/2005 69
Class Imbalance Problem
Lots of classification problems where the classes are skewed (more records from one class than another)another)– Credit card fraud– Intrusion detection– Defective products in manufacturing assembly
line
Introduction to Data Mining 08/26/2006 70
Challenges
Evaluation measures such as accuracy is not well-suited for imbalanced class
Detecting the rare class is like finding needle in a haystack
Introduction to Data Mining 08/26/2006 71
Confusion Matrix
Confusion Matrix:
PREDICTED CLASS
ACTUALCLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
Introduction to Data Mining 08/26/2006 72
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
Accuracy
PREDICTED CLASS
Class=Yes Class=No
Most widely-used metric:
ACTUALCLASS
Class=Yes a(TP)
b(FN)
Class=No c(FP)
d(TN)
Introduction to Data Mining 08/26/2006 73
Most widely-used metric:
FNFPTNTPTNTP
dcbada
++++=
++++=Accuracy
Problem with Accuracy
Consider a 2-class problem– Number of Class 0 examples = 9990
Number of Class 1 examples = 10– Number of Class 1 examples = 10
If a model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %– This is misleading because the model does
not detect any class 1 example
Introduction to Data Mining 08/26/2006 74
not detect any class 1 example– Detecting the rare class is usually more
interesting (e.g., frauds, intrusions, defects, etc)
Alternative Measures
PREDICTED CLASSClass=Yes Class=No
Cl Y b
caa+
= (p)Precision
ACTUALCLASS
Class=Yes a(TP)
b(FN)
Class=No c(FP)
d(TN)
Introduction to Data Mining 08/26/2006 75
cbaa
prrp
baa
++=
+=
+=
222(F) measure-F
(r) Recall
ROC (Receiver Operating Characteristic)
A graphical approach for displaying trade-off between detection rate and false alarm rateD l d i 1950 f i l d t ti th tDeveloped in 1950s for signal detection theory to analyze noisy signals ROC curve plots TPR against FPR– TPR = TP/(TP+FN), FPR = FP/(TN+FP)– Performance of a model represented as a
Introduction to Data Mining 08/26/2006 76
point in an ROC curve– Changing the threshold parameter of classifier
changes the location of the point
ROC Curve
(TPR,FPR):(0,0): declare everything
to be negative classto be negative class(1,1): declare everything
to be positive class(1,0): ideal
Diagonal line:
Introduction to Data Mining 08/26/2006 77
– Random guessing– Below diagonal line:
prediction is opposite of the true class
ROC (Receiver Operating Characteristic)
To draw ROC curve, classifier must produce continuous-valued output
Outputs are used to rank test records from the most– Outputs are used to rank test records, from the most likely positive class record to the least likely positive class record
Many classifiers produce only discrete outputs (i.e., predicted class)
Introduction to Data Mining 08/26/2006 78
– How to get continuous-valued outputs?Decision trees, rule-based classifiers, neural networks,
Bayesian classifiers, k-nearest neighbors, SVM
Example: Decision Trees
Decision Tree
C ti l d t tContinuous-valued outputs
Introduction to Data Mining 08/26/2006 79
ROC Curve Example
Introduction to Data Mining 08/26/2006 80
Using ROC for Model Comparison
No model consistently outperform the other
M1 is better forM1 is better for small FPRM2 is better for large FPR
Area Under the ROC curve
Introduction to Data Mining 08/26/2006 81
Ideal: Area = 1
Random guess:Area = 0.5
How to Construct an ROC curve
Instance score(+|A) True Class1 0.95 +2 0 93 +
• Use classifier that produces continuous-valued output for each test instance score(+|A)
2 0.93 +3 0.87 -4 0.85 -5 0.85 -6 0.85 +7 0.76 -8 0 53 +
• Sort the instances according to score(+|A) in decreasing order
• Apply threshold at each unique value of score(+|A)
Introduction to Data Mining 08/26/2006 82
8 0.539 0.43 -
10 0.25 +
• Count the number of TP, FP, TN, FN at each threshold
•TPR = TP/(TP+FN)
•FPR = FP/(FP + TN)
How to construct an ROC curve
Class + - + - - - + - + +
0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
Threshold >=
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0
Introduction to Data Mining 08/26/2006 83
ROC Curve:
Handling Class Imbalanced Problem
Class-based ordering (e.g. RIPPER)– Rules for rare class have higher priority
Cost-sensitive classification– Misclassifying rare class as majority class is
more expensive than misclassifying majority as rare class
Introduction to Data Mining 08/26/2006 84
Sampling-based approaches
Cost Matrix
PREDICTED CLASS
ACTUALClass=Yes Class=No
CLASS Class=Yes f(Yes, Yes) f(Yes,No)
Class=No f(No, Yes) f(No, No)
Cost Matrix
PREDICTED CLASS
C(i j) Cl Y Cl N
C(i,j): Cost of misclassifying class i example as class j
∑ )()(C t jifjiC
Introduction to Data Mining 08/26/2006 85
ACTUALCLASS
C(i, j) Class=Yes Class=No
Class=Yes C(Yes, Yes) C(Yes, No)
Class=No C(No, Yes) C(No, No)
∑ ×= ),(),(Cost jifjiC
Computing Cost of Classification
Cost Matrix
PREDICTED CLASS
ACTUALC(i,j) + -
ACTUALCLASS
+ -1 100- 1 0
Model M1
PREDICTED CLASS
ACTUAL+ -
+ 150 40
Model M2
PREDICTED CLASS
ACTUAL+ -
+ 250 45
Introduction to Data Mining 08/26/2006 86
CLASS+ 150 40- 60 250
CLASS+ 250 45- 5 200
Accuracy = 80%Cost = 3910
Accuracy = 90%Cost = 4255
Cost Sensitive Classification
Example: Bayesian classifer– Given a test record x:
Compute p(i|x) for each class iDecision rule: classify node as class k if
)|(maxarg xipki
=
Introduction to Data Mining 08/26/2006 87
– For 2-class, classify x as + if p(+|x) > p(-|x)This decision rule implicitly assumes that
C(+|+) = C(-|-) = 0 and C(+|-) = C(-|+)
Cost Sensitive Classification
General decision rule: – Classify test record x as class k if
2-class:– Cost(+) = p(+|x) C(+,+) + p(-|x) C(-,+)– Cost(-) = p(+|x) C(+,-) + p(-|x) C(-,-)
∑ ×=ij
jiCxipk ),()|(minarg
Introduction to Data Mining 08/26/2006 88
( ) ( | ) ( ) ( | ) ( )– Decision rule: classify x as + if Cost(+) < Cost(-)
if C(+,+) = C(-,-) = 0:
),(),(),()|(
−+++−+−
>+CC
Cxp
Sampling-based Approaches
Modify the distribution of training data so that rare class is well-represented in training set
U d l th j it l– Undersample the majority class– Oversample the rare class
Advantages and disadvantages
Introduction to Data Mining 08/26/2006 89