Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Classification: Alternative Techniques

Dr. Hui XiongRutgers University

Introduction to Data Mining 08/26/2006 1


Rule-based Classifier


Rule-Based Classifier

Classify records by using a collection of “if…then…” rules

Rule: (Condition) → y– where

Condition is a conjunctions of attributes y is the class label

– LHS: rule antecedent or condition


– RHS: rule consequent– Examples of classification rules:

(Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds(Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No

Rule-based Classifier (Example)


Application of Rule-Based Classifier

A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule


The rule r1 covers a hawk => BirdThe rule r3 covers the grizzly bear => Mammal

Rule Coverage and Accuracy

Coverage of a rule:– Fraction of records

th t ti f th

Tid Refund Marital Status

Taxable Income Class

1 Yes Single 125K No

2 No Married 100K Nothat satisfy the antecedent of a rule

Accuracy of a rule:– Fraction of records

that satisfy both the t d t d

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes


antecedent and consequent of a rule

9 No Married 75K No

10 No Single 90K Yes 10

(Status=Single) → No

Coverage = 40%, Accuracy = 50%

How does Rule-based Classifier Work?


A lemur triggers rule r3, so it is classified as a mammalA turtle triggers both r4 and r5A dogfish shark triggers none of the rules

Characteristics of Rule-Based Classifier

Mutually exclusive rules– Classifier contains mutually exclusive rules if

th l i d d t f h ththe rules are independent of each other– Every record is covered by at most one rule

Exhaustive rules– Classifier has exhaustive coverage if it


gaccounts for every possible combination of attribute values

– Each record is covered by at least one rule

From Decision Trees To Rules

Refund

Classification Rules

(Refund=Yes) ==> No

NONO

NONO

Yes No

{Married}{Single,

Divorced}

< 80K > 80K

Taxable Income

Marital Status

(Refund=No, Marital Status={Single,Divorced},Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced},Taxable Income>80K) ==> Yes

(Refund=No, Marital Status={Married}) ==> No


YESYESNONO Rules are mutually exclusive and exhaustive

Rule set contains as much information as the tree

Rules Can Be Simplified

Yes No

Refund


Taxable Income Cheat



YESYESNONO

NONO

NONO

{Married}{Single,

Divorced}

< 80K > 80K

Taxable Income

Marital Status


3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes


9 No Married 75K No

10 No Single 90K Yes 10

Initial Rule: (Refund=No) ∧ (Status=Married) → No

Simplified Rule: (Status=Married) → No

Effect of Rule Simplification

Rules are no longer mutually exclusive– A record may trigger more than one rule – Solution?

Ordered rule setUnordered rule set – use voting schemes

Rules are no longer exhaustive


g– A record may not trigger any rules– Solution?

Use a default class

Ordered Rule Set

Rules are rank ordered according to their priority– An ordered rule set is known as a decision list

When a test record is presented to the classifier – It is assigned to the class label of the highest ranked rule it has

triggered– If none of the rules fired, it is assigned to the default class


Rule Ordering Schemes

Rule-based ordering– Individual rules are ranked based on their quality/priority

Class-based ordering– Rules that belong to the same class appear together


Building Classification Rules

Direct Method: Extract rules directly from data

RIPPER CN2 1R d AQe.g.: RIPPER, CN2, 1R, and AQ

Indirect Method:Extract rules from other classification models (e.g. decision trees, neural networks, SVM, etc).

C4 5 l


e.g: C4.5rules

Direct Method: Sequential Covering


Example of Sequential Covering


(ii) Step 1

Example of Sequential Covering…

R1 R1

R2


(iii) Step 2 (iv) Step 3

Aspects of Sequential Covering

Rule Growing– Rule evaluation

Instance Elimination

Stopping Criterion


Rule Pruning

Rule Growing

Two common strategies

Status =Single

Status =Divorced

Status =Married

Income> 80K...

Yes: 3No: 4{ }

Yes: 0N 3

Refund=No

Yes: 3N 4

Yes: 2N 1

Yes: 1N 0

Yes: 3N 1


No: 3No: 4 No: 1 No: 0 No: 1

(a) General-to-specific

Rule Evaluation

Evaluation metric determines which conjunct should be added during rule growing

n– Accuracy

– Laplacekn

nc

++

=1

n : Number of instances covered by rule

nc : Number of instances of class c covered by rule

nnc=


– M-estimate

kn +

knkpnc

++

=

k : Number of classes

p : Prior probability

Rule Growing (Examples)

CN2 Algorithm:– Start from an empty conjunct: {}– Add conjuncts that minimizes the entropy measure: {A}, {A,B}, …

D t i th l t b t ki j it l f i t– Determine the rule consequent by taking majority class of instances covered by the rule

RIPPER Algorithm:– Start from an empty rule: {} => class– Add conjuncts that maximizes FOIL’s information gain measure:

R0: {} => class (initial rule)R1: {A} => class (rule after adding conjunct)Gain(R0 R1) = t [ log (p1/(p1+n1)) log (p0/(p0 + n0)) ]


Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ]where t: number of positive instances covered by both R0 and R1

p0: number of positive instances covered by R0n0: number of negative instances covered by R0p1: number of positive instances covered by R1n1: number of negative instances covered by R1

Instance Elimination

Why do we need to eliminate instances?

– Otherwise, the next rule is identical to previous ruleidentical to previous rule

Why do we remove positive instances?

– Ensure that the next rule is different

Why do we remove negative instances?


g– Prevent underestimating

accuracy of rule– Compare rules R2 and R3

in the diagram

Stopping Criterion and Rule Pruning

Examples of stopping criterion:– If rule does not improve significantly after adding

conjunctconjunct– If rule starts covering examples from another class

Rule Pruning– Similar to post-pruning of decision trees– Example: using validation set (reduced error pruning)


a p e us g a da o se ( educed e o p u g)Remove one of the conjuncts in the rule Compare error rate on validation set before and after pruningIf error improves, prune the conjunct

Summary of Direct Method

Initial rule set is empty

Repeat– Grow a single rule– Remove Instances covered by the rule– Prune the rule (if necessary)– Add rule to the current rule set


– Add rule to the current rule set

Direct Method: RIPPER

For 2-class problem, choose one of the classes as positive class, and the other as negative class– Learn the rules for positive classLearn the rules for positive class– Use negative class as default

For multi-class problem– Order the classes according to increasing class

prevalence (fraction of instances that belong to a


particular class)– Learn the rule set for smallest class first, treat the rest

as negative class– Repeat with next smallest class as positive class


Rule growing:– Start from an empty rule: {} → +– Add conjuncts as long as they improve FOIL’s j g y p

information gain– Stop when rule no longer covers negative examples– Prune the rule immediately using incremental reduced

error pruning– Measure for pruning: v = (p-n)/(p+n)

p: number of positive examples covered by the rule in


p p p ythe validation set

n: number of negative examples covered by the rule inthe validation set

– Pruning method: delete any final sequence of conditions that maximizes v


Building a Rule Set:– Use sequential covering algorithm

Grow a rule to cover the current set of positive examples

Eliminate both positive and negative examples covered by the rule

– Each time a rule is added to the rule set, compute the new description length


compute the new description lengthstop adding new rules when the new description

length is d bits longer than the smallest description length obtained so far

Indirect Methods


Indirect Method: C4.5rules

Extract rules for every path from root to leaf nodesFor each rule, r: A → y, – consider alternative rule r’: A’ → y where A’ is

obtained by removing one of the conjuncts in A– Compare the pessimistic error rate for r against

all r’sPrune if one of the r’s has lower pessimistic error t


rate

– Repeat until pessimistic error rate can no longer be improved

Indirect Method: C4.5rules

Use class-based ordering– Rules that predict the same class are grouped

t th i t th b ttogether into the same subset– Compute total description length for each

class– Classes are ordered in increasing order of

their total description length


Example

C4.5rules:(Give Birth=No, Can Fly=Yes) → Birds

(Give Birth=No, Live in Water=Yes) → Fishes

GiveBirth?

Yes No ( , )

(Give Birth=Yes) → Mammals

(Give Birth=No, Can Fly=No, Live in Water=No) → Reptiles

( ) → Amphibians

Live InWater?

Can

Mammals

Fishes Amphibians

Yes

Sometimes

No


CanFly?

Fishes Amphibians

Birds Reptiles

Yes No

Characteristics of Rule-Based Classifiers

As highly expressive as decision treesEasy to interpretEasy to generateCan classify new instances rapidlyPerformance comparable to decision trees



Instance-Based Classifiers


Instance-Based Classifiers

Atr1 ……... AtrN Class

Set of Stored Cases • Store the training records

• Use training records to predict the class label of

A

B

B

C

A Atr1 ……... AtrN

Unseen Case

predict the class label of unseen cases


C

B

Instance Based Classifiers

Examples:– Rote-learner

Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly

– Nearest neighborU k “ l t” i t ( t i hb ) f


Uses k “closest” points (nearest neighbors) for performing classification

Nearest Neighbor Classifiers

Basic idea:– If it walks like a duck, quacks like a duck, then

it’ b bl d kit’s probably a duck

Test Record

Compute Distance


Training Records

Choose k of the “nearest” records

Nearest-Neighbor Classifiers

Requires three things– The set of stored records– Distance metric to compute

distance between records

Unknown record

distance between records– The value of k, the number of

nearest neighbors to retrieve

To classify an unknown record:– Compute distance to other

training recordsId tif k t i hb


– Identify k nearest neighbors – Use class labels of nearest

neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Definition of Nearest Neighbor

X X X


(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x

1 nearest-neighbor

Voronoi Diagram


Nearest Neighbor Classification

Compute distance between two points:– Example: Euclidean distance

Determine the class from nearest neighbor list– take the majority vote of class labels among

∑ −=i ii

qpqpd 2)(),(


– take the majority vote of class labels among the k-nearest neighbors

– Weigh the vote according to distanceweight factor, w = 1/d2

Nearest Neighbor Classification…

Choosing the value of k:– If k is too small, sensitive to noise points

If k i t l i hb h d i l d i t f– If k is too large, neighborhood may include points from other classes



Scaling issues– Attributes may have to be scaled to prevent

di t f b i d i t d bdistance measures from being dominated by one of the attributes

– Example:height of a person may vary from 1.5m to 1.8mweight of a person may vary from 90lb to 300lb

$ $


income of a person may vary from $10K to $1M


Problem with Euclidean measure:– High dimensional data

curse of dimensionality

– Can produce counter-intuitive results

1 1 1 1 1 1 1 1 1 1 1 0

0 1 1 1 1 1 1 1 1 1 1 1

1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 1vs


0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1

d = 1.4142 d = 1.4142

Solution: Normalize the vectors to unit length

Nearest neighbor Classification…

k-NN classifiers are lazy learners – It does not build models explicitly– Unlike eager learners such as decision tree

induction and rule-based systems– Classifying unknown records are relatively

expensive


Example: PEBLS

PEBLS: Parallel Examplar-Based Learning System (Cost & Salzberg)

W k ith b th ti d i l– Works with both continuous and nominal features

For nominal features, distance between two nominal values is computed using modified value difference metric (MVDM)

– Each record is assigned a weight factor


Each record is assigned a weight factor– Number of nearest neighbor, k = 1

Example: PEBLS

Distance between nominal attribute values:

d(Single,Married) = | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1

Tid Refund MaritalStatus

TaxableIncome Cheat



3 No Single 70K No d(Single,Divorced) = | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0d(Married,Divorced) = | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1d(Refund=Yes,Refund=No)= | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7

3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10


ClassMarital Status

Single Married Divorced

Yes 2 0 1

No 2 4 1

∑ −=i

ii

nn

nnVVd

2

2

1

121 ),(Class

Refund

Yes No

Yes 0 3

No 3 4

Example: PEBLS


Taxable Income Cheat

X Yes Single 125K No

Y No Married 100K No

∑=

=Δd

iiiYX YXdwwYX

1

2),(),(

Y No Married 100K No10

Distance between record X and record Y:

where: predictionfor usedisXtimesofNumber w


correctly predicts X timesofNumber p

=Xw

wX ≅ 1 if X makes accurate prediction most of the time

wX > 1 if X is not reliable for making predictions


Bayesian Classifiers



Ensemble Methods


Ensemble Methods

Construct a set of classifiers from the training data

Predict class label of test records by combining the predictions made by multiple classifiers


Why Ensemble Methods work?

Suppose there are 25 base classifiers– Each classifier hasEach classifier has

error rate, ε = 0.35– Assume errors made

by classifiers are uncorrelated

– Probability that the ensemble classifier makes


ensemble classifier makes a wrong prediction:

∑=

− =−⎟⎟⎠

⎞⎜⎜⎝

⎛=≥

25

13

25 06.0)1(25

)13(i

ii

iXP εε

General Approach


Types of Ensemble Methods

Bayesian ensemble– Example: Mixture of Gaussian

Manipulate data distributionManipulate data distribution– Example: Resampling method

Manipulate input features– Example: Feature subset selection

Manipulate class labels


– Example: error-correcting output codingIntroduce randomness into learning algorithm– Example: Random forests

Bagging

Sampling with replacement

Build classifier on each bootstrap sample

Original Data 1 2 3 4 5 6 7 8 9 10Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7


Each sample has probability (1 – 1/n)n of being selected

Bagging Algorithm


Bagging Example

Consider 1-dimensional data set:

Classifier is a decision stump– Decision rule: x ≤ k versus x > k– Split point k is chosen based on entropy

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1y 1 1 1 -1 -1 -1 -1 1 1 1


– Split point k is chosen based on entropy

x ≤ k

yleft yright

True False

Bagging Example

Bagging Round 1:x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9y 1 1 1 1 -1 -1 -1 -1 1 1

B i R d 2Bagging Round 2:x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1y 1 1 1 -1 -1 -1 1 1 1 1

Bagging Round 3:x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9y 1 1 1 -1 -1 -1 -1 -1 1 1

Bagging Round 4:x 0 1 0 1 0 2 0 4 0 4 0 5 0 5 0 7 0 8 0 9


x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9y 1 1 1 -1 -1 -1 -1 -1 1 1

Bagging Round 5:x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1y 1 1 1 -1 -1 -1 -1 1 1 1

Bagging Example

Bagging Round 6:x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1y 1 -1 -1 -1 -1 -1 -1 1 1 1

B i R d 7Bagging Round 7:x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1y 1 -1 -1 -1 -1 1 1 1 1 1

Bagging Round 8:x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1y 1 1 -1 -1 -1 -1 -1 1 1 1

Bagging Round 9:x 0 1 0 3 0 4 0 4 0 6 0 7 0 7 0 8 1 1


x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1y 1 1 -1 -1 -1 -1 -1 1 1 1

Bagging Round 10:x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9y 1 1 1 1 1 1 1 1 1 1

Bagging Example

Summary of Training sets:

Round Split Point Left Class Right Class1 0.35 1 -12 0.7 1 13 0.35 1 -14 0.3 1 -15 0.35 1 -16 0.75 -1 17 0.75 -1 18 0 75 1 1


8 0.75 -1 19 0.75 -1 110 0.05 1 1

Bagging Example

Assume test set is the same as the original dataUse majority vote to determine class of ensemble l ificlassifier

Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.01 1 1 1 -1 -1 -1 -1 -1 -1 -12 1 1 1 1 1 1 1 1 1 13 1 1 1 -1 -1 -1 -1 -1 -1 -14 1 1 1 -1 -1 -1 -1 -1 -1 -15 1 1 1 -1 -1 -1 -1 -1 -1 -16 -1 -1 -1 -1 -1 -1 -1 1 1 1


7 -1 -1 -1 -1 -1 -1 -1 1 1 18 -1 -1 -1 -1 -1 -1 -1 1 1 19 -1 -1 -1 -1 -1 -1 -1 1 1 110 1 1 1 1 1 1 1 1 1 1

Sum 2 2 2 -6 -6 -6 -6 2 2 2Sign 1 1 1 -1 -1 -1 -1 1 1 1Predicted

Class

Boosting

An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified recordspreviously misclassified records– Initially, all N records are assigned equal

weights– Unlike bagging, weights may change at the

end of each boosting round


Boosting

Records that are wrongly classified will have their weights increasedR d th t l ifi d tl ill hRecords that are classified correctly will have their weights decreased

Original Data 1 2 3 4 5 6 7 8 9 10Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4


• Example 4 is hard to classify

• Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds

AdaBoost

Base classifiers: C1, C2, …, CT

Error rate:

Importance of a classifier:

( )∑=

≠=N

jjjiji yxCw

N 1

)(1 δε


⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

i

ii ε

εα 1ln21

AdaBoost Algorithm

Weight update:

)(ifexp)()1( iij

jij yxCw j⎪

⎨⎧ =−

+α

If any intermediate rounds produce error rate higher than 50%, the weights are reverted back

factor ionnormalizat theis where

)( ifexp)(

j

iij

iij

j

iji

Z

yxCy

Zw

j⎪⎩⎨

≠= α


to 1/n and the resampling procedure is repeatedClassification:

( )∑=

==T

jjj

yyxCxC

1

)(maxarg)(* δα

AdaBoost Algorithm


AdaBoost Example

Consider 1-dimensional data set:

Classifier is a decision stump– Decision rule: x ≤ k versus x > k– Split point k is chosen based on entropy

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1y 1 1 1 -1 -1 -1 -1 1 1 1


– Split point k is chosen based on entropy

x ≤ k

yleft yright

True False

AdaBoost Example

Training sets for the first 3 boosting rounds:Boosting Round 1:

x 0 1 0 4 0 5 0 6 0 6 0 7 0 7 0 7 0 8 1x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1y 1 -1 -1 -1 -1 -1 -1 -1 1 1

Boosting Round 2:x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3y 1 1 1 1 1 1 1 1 1 1

Boosting Round 3:x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7y 1 1 -1 -1 -1 -1 -1 -1 -1 -1


Summary:

y

Round Split Point Left Class Right Class alpha1 0.75 -1 1 1.7382 0.05 1 1 2.77843 0.3 1 -1 4.1195

AdaBoost Example

WeightsRound x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0

1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Classification

2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.013 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009

Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.01 -1 -1 -1 -1 -1 -1 -1 1 1 1


2 1 1 1 1 1 1 1 1 1 13 1 1 1 -1 -1 -1 -1 -1 -1 -1

Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397Sign 1 1 1 -1 -1 -1 -1 1 1 1Predicted

Class


Imbalanced Class Problem


Class Imbalance Problem

Lots of classification problems where the classes are skewed (more records from one class than another)another)– Credit card fraud– Intrusion detection– Defective products in manufacturing assembly

line


Challenges

Evaluation measures such as accuracy is not well-suited for imbalanced class

Detecting the rare class is like finding needle in a haystack


Confusion Matrix

Confusion Matrix:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d


a: TP (true positive)

b: FN (false negative)

c: FP (false positive)

d: TN (true negative)

Accuracy

PREDICTED CLASS

Class=Yes Class=No

Most widely-used metric:

ACTUALCLASS

Class=Yes a(TP)

b(FN)

Class=No c(FP)

d(TN)


Most widely-used metric:

FNFPTNTPTNTP

dcbada

++++=

++++=Accuracy

Problem with Accuracy

Consider a 2-class problem– Number of Class 0 examples = 9990

Number of Class 1 examples = 10– Number of Class 1 examples = 10

If a model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %– This is misleading because the model does

not detect any class 1 example


not detect any class 1 example– Detecting the rare class is usually more

interesting (e.g., frauds, intrusions, defects, etc)

Alternative Measures

PREDICTED CLASSClass=Yes Class=No

Cl Y b

caa+

= (p)Precision

ACTUALCLASS

Class=Yes a(TP)

b(FN)

Class=No c(FP)

d(TN)


cbaa

prrp

baa

++=

+=

+=

222(F) measure-F

(r) Recall

ROC (Receiver Operating Characteristic)

A graphical approach for displaying trade-off between detection rate and false alarm rateD l d i 1950 f i l d t ti th tDeveloped in 1950s for signal detection theory to analyze noisy signals ROC curve plots TPR against FPR– TPR = TP/(TP+FN), FPR = FP/(TN+FP)– Performance of a model represented as a


point in an ROC curve– Changing the threshold parameter of classifier

changes the location of the point

ROC Curve

(TPR,FPR):(0,0): declare everything

to be negative classto be negative class(1,1): declare everything

to be positive class(1,0): ideal

Diagonal line:


– Random guessing– Below diagonal line:

prediction is opposite of the true class

ROC (Receiver Operating Characteristic)

To draw ROC curve, classifier must produce continuous-valued output

Outputs are used to rank test records from the most– Outputs are used to rank test records, from the most likely positive class record to the least likely positive class record

Many classifiers produce only discrete outputs (i.e., predicted class)


– How to get continuous-valued outputs?Decision trees, rule-based classifiers, neural networks,

Bayesian classifiers, k-nearest neighbors, SVM

Example: Decision Trees

Decision Tree

C ti l d t tContinuous-valued outputs


ROC Curve Example


Using ROC for Model Comparison

No model consistently outperform the other

M1 is better forM1 is better for small FPRM2 is better for large FPR

Area Under the ROC curve


Ideal: Area = 1

Random guess:Area = 0.5

How to Construct an ROC curve

Instance score(+|A) True Class1 0.95 +2 0 93 +

• Use classifier that produces continuous-valued output for each test instance score(+|A)

2 0.93 +3 0.87 -4 0.85 -5 0.85 -6 0.85 +7 0.76 -8 0 53 +

• Sort the instances according to score(+|A) in decreasing order

• Apply threshold at each unique value of score(+|A)


8 0.539 0.43 -

10 0.25 +

• Count the number of TP, FP, TN, FN at each threshold

•TPR = TP/(TP+FN)

•FPR = FP/(FP + TN)

How to construct an ROC curve

Class + - + - - - + - + +

0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

Threshold >=

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0


ROC Curve:

Handling Class Imbalanced Problem

Class-based ordering (e.g. RIPPER)– Rules for rare class have higher priority

Cost-sensitive classification– Misclassifying rare class as majority class is

more expensive than misclassifying majority as rare class


Sampling-based approaches

Cost Matrix

PREDICTED CLASS

ACTUALClass=Yes Class=No

CLASS Class=Yes f(Yes, Yes) f(Yes,No)

Class=No f(No, Yes) f(No, No)

Cost Matrix

PREDICTED CLASS

C(i j) Cl Y Cl N

C(i,j): Cost of misclassifying class i example as class j

∑ )()(C t jifjiC


ACTUALCLASS

C(i, j) Class=Yes Class=No

Class=Yes C(Yes, Yes) C(Yes, No)

Class=No C(No, Yes) C(No, No)

∑ ×= ),(),(Cost jifjiC

Computing Cost of Classification

Cost Matrix

PREDICTED CLASS

ACTUALC(i,j) + -

ACTUALCLASS

+ -1 100- 1 0

Model M1

PREDICTED CLASS

ACTUAL+ -

+ 150 40

Model M2

PREDICTED CLASS

ACTUAL+ -

+ 250 45


CLASS+ 150 40- 60 250

CLASS+ 250 45- 5 200

Accuracy = 80%Cost = 3910

Accuracy = 90%Cost = 4255

Cost Sensitive Classification

Example: Bayesian classifer– Given a test record x:

Compute p(i|x) for each class iDecision rule: classify node as class k if

)|(maxarg xipki

=


– For 2-class, classify x as + if p(+|x) > p(-|x)This decision rule implicitly assumes that

C(+|+) = C(-|-) = 0 and C(+|-) = C(-|+)

Cost Sensitive Classification

General decision rule: – Classify test record x as class k if

2-class:– Cost(+) = p(+|x) C(+,+) + p(-|x) C(-,+)– Cost(-) = p(+|x) C(+,-) + p(-|x) C(-,-)

∑ ×=ij

jiCxipk ),()|(minarg


( ) ( | ) ( ) ( | ) ( )– Decision rule: classify x as + if Cost(+) < Cost(-)

if C(+,+) = C(-,-) = 0:

),(),(),()|(

−+++−+−

>+CC

Cxp

Sampling-based Approaches

Modify the distribution of training data so that rare class is well-represented in training set

U d l th j it l– Undersample the majority class– Oversample the rare class

Advantages and disadvantages


Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Documents