Top Banner
Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/26/2006 1 Classification: Alternative Techniques Rule-based Classifier Introduction to Data Mining 08/26/2006 2
45

Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Oct 08, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Classification: Alternative Techniques

Dr. Hui XiongRutgers University

Introduction to Data Mining 08/26/2006 1

Classification: Alternative Techniques

Rule-based Classifier

Introduction to Data Mining 08/26/2006 2

Page 2: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Rule-Based Classifier

Classify records by using a collection of “if…then…” rules

Rule: (Condition) → y– where

Condition is a conjunctions of attributes y is the class label

– LHS: rule antecedent or condition

Introduction to Data Mining 08/26/2006 3

– RHS: rule consequent– Examples of classification rules:

(Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds(Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No

Rule-based Classifier (Example)

Introduction to Data Mining 08/26/2006 4

Page 3: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Application of Rule-Based Classifier

A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule

Introduction to Data Mining 08/26/2006 5

The rule r1 covers a hawk => BirdThe rule r3 covers the grizzly bear => Mammal

Rule Coverage and Accuracy

Coverage of a rule:– Fraction of records

th t ti f th

Tid Refund Marital Status

Taxable Income Class

1 Yes Single 125K No

2 No Married 100K Nothat satisfy the antecedent of a rule

Accuracy of a rule:– Fraction of records

that satisfy both the t d t d

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

Introduction to Data Mining 08/26/2006 6

antecedent and consequent of a rule

9 No Married 75K No

10 No Single 90K Yes 10

(Status=Single) → No

Coverage = 40%, Accuracy = 50%

Page 4: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

How does Rule-based Classifier Work?

Introduction to Data Mining 08/26/2006 7

A lemur triggers rule r3, so it is classified as a mammalA turtle triggers both r4 and r5A dogfish shark triggers none of the rules

Characteristics of Rule-Based Classifier

Mutually exclusive rules– Classifier contains mutually exclusive rules if

th l i d d t f h ththe rules are independent of each other– Every record is covered by at most one rule

Exhaustive rules– Classifier has exhaustive coverage if it

Introduction to Data Mining 08/26/2006 8

gaccounts for every possible combination of attribute values

– Each record is covered by at least one rule

Page 5: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

From Decision Trees To Rules

Refund

Classification Rules

(Refund=Yes) ==> No

NONO

NONO

Yes No

{Married}{Single,

Divorced}

< 80K > 80K

Taxable Income

Marital Status

(Refund=No, Marital Status={Single,Divorced},Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced},Taxable Income>80K) ==> Yes

(Refund=No, Marital Status={Married}) ==> No

Introduction to Data Mining 08/26/2006 9

YESYESNONO Rules are mutually exclusive and exhaustive

Rule set contains as much information as the tree

Rules Can Be Simplified

Yes No

Refund

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

YESYESNONO

NONO

NONO

{Married}{Single,

Divorced}

< 80K > 80K

Taxable Income

Marital Status

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

Introduction to Data Mining 08/26/2006 10

9 No Married 75K No

10 No Single 90K Yes 10

Initial Rule: (Refund=No) ∧ (Status=Married) → No

Simplified Rule: (Status=Married) → No

Page 6: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Effect of Rule Simplification

Rules are no longer mutually exclusive– A record may trigger more than one rule – Solution?

Ordered rule setUnordered rule set – use voting schemes

Rules are no longer exhaustive

Introduction to Data Mining 08/26/2006 11

g– A record may not trigger any rules– Solution?

Use a default class

Ordered Rule Set

Rules are rank ordered according to their priority– An ordered rule set is known as a decision list

When a test record is presented to the classifier – It is assigned to the class label of the highest ranked rule it has

triggered– If none of the rules fired, it is assigned to the default class

Introduction to Data Mining 08/26/2006 12

Page 7: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Rule Ordering Schemes

Rule-based ordering– Individual rules are ranked based on their quality/priority

Class-based ordering– Rules that belong to the same class appear together

Introduction to Data Mining 08/26/2006 13

Building Classification Rules

Direct Method: Extract rules directly from data

RIPPER CN2 1R d AQe.g.: RIPPER, CN2, 1R, and AQ

Indirect Method:Extract rules from other classification models (e.g. decision trees, neural networks, SVM, etc).

C4 5 l

Introduction to Data Mining 08/26/2006 14

e.g: C4.5rules

Page 8: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Direct Method: Sequential Covering

Introduction to Data Mining 08/26/2006 15

Example of Sequential Covering

Introduction to Data Mining 08/26/2006 16

(ii) Step 1

Page 9: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Example of Sequential Covering…

R1 R1

R2

Introduction to Data Mining 08/26/2006 17

(iii) Step 2 (iv) Step 3

Aspects of Sequential Covering

Rule Growing– Rule evaluation

Instance Elimination

Stopping Criterion

Introduction to Data Mining 08/26/2006 18

Rule Pruning

Page 10: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Rule Growing

Two common strategies

Status =Single

Status =Divorced

Status =Married

Income> 80K...

Yes: 3No: 4{ }

Yes: 0N 3

Refund=No

Yes: 3N 4

Yes: 2N 1

Yes: 1N 0

Yes: 3N 1

Introduction to Data Mining 08/26/2006 19

No: 3No: 4 No: 1 No: 0 No: 1

(a) General-to-specific

Rule Evaluation

Evaluation metric determines which conjunct should be added during rule growing

n– Accuracy

– Laplacekn

nc

++

=1

n : Number of instances covered by rule

nc : Number of instances of class c covered by rule

nnc=

Introduction to Data Mining 08/26/2006 20

– M-estimate

kn +

knkpnc

++

=

k : Number of classes

p : Prior probability

Page 11: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Rule Growing (Examples)

CN2 Algorithm:– Start from an empty conjunct: {}– Add conjuncts that minimizes the entropy measure: {A}, {A,B}, …

D t i th l t b t ki j it l f i t– Determine the rule consequent by taking majority class of instances covered by the rule

RIPPER Algorithm:– Start from an empty rule: {} => class– Add conjuncts that maximizes FOIL’s information gain measure:

R0: {} => class (initial rule)R1: {A} => class (rule after adding conjunct)Gain(R0 R1) = t [ log (p1/(p1+n1)) log (p0/(p0 + n0)) ]

Introduction to Data Mining 08/26/2006 21

Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ]where t: number of positive instances covered by both R0 and R1

p0: number of positive instances covered by R0n0: number of negative instances covered by R0p1: number of positive instances covered by R1n1: number of negative instances covered by R1

Instance Elimination

Why do we need to eliminate instances?

– Otherwise, the next rule is identical to previous ruleidentical to previous rule

Why do we remove positive instances?

– Ensure that the next rule is different

Why do we remove negative instances?

Introduction to Data Mining 08/26/2006 22

g– Prevent underestimating

accuracy of rule– Compare rules R2 and R3

in the diagram

Page 12: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Stopping Criterion and Rule Pruning

Examples of stopping criterion:– If rule does not improve significantly after adding

conjunctconjunct– If rule starts covering examples from another class

Rule Pruning– Similar to post-pruning of decision trees– Example: using validation set (reduced error pruning)

Introduction to Data Mining 08/26/2006 23

a p e us g a da o se ( educed e o p u g)Remove one of the conjuncts in the rule Compare error rate on validation set before and after pruningIf error improves, prune the conjunct

Summary of Direct Method

Initial rule set is empty

Repeat– Grow a single rule– Remove Instances covered by the rule– Prune the rule (if necessary)– Add rule to the current rule set

Introduction to Data Mining 08/26/2006 24

– Add rule to the current rule set

Page 13: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Direct Method: RIPPER

For 2-class problem, choose one of the classes as positive class, and the other as negative class– Learn the rules for positive classLearn the rules for positive class– Use negative class as default

For multi-class problem– Order the classes according to increasing class

prevalence (fraction of instances that belong to a

Introduction to Data Mining 08/26/2006 25

particular class)– Learn the rule set for smallest class first, treat the rest

as negative class– Repeat with next smallest class as positive class

Direct Method: RIPPER

Rule growing:– Start from an empty rule: {} → +– Add conjuncts as long as they improve FOIL’s j g y p

information gain– Stop when rule no longer covers negative examples– Prune the rule immediately using incremental reduced

error pruning– Measure for pruning: v = (p-n)/(p+n)

p: number of positive examples covered by the rule in

Introduction to Data Mining 08/26/2006 26

p p p ythe validation set

n: number of negative examples covered by the rule inthe validation set

– Pruning method: delete any final sequence of conditions that maximizes v

Page 14: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Direct Method: RIPPER

Building a Rule Set:– Use sequential covering algorithm

Grow a rule to cover the current set of positive examples

Eliminate both positive and negative examples covered by the rule

– Each time a rule is added to the rule set, compute the new description length

Introduction to Data Mining 08/26/2006 27

compute the new description lengthstop adding new rules when the new description

length is d bits longer than the smallest description length obtained so far

Indirect Methods

Introduction to Data Mining 08/26/2006 28

Page 15: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Indirect Method: C4.5rules

Extract rules for every path from root to leaf nodesFor each rule, r: A → y, – consider alternative rule r’: A’ → y where A’ is

obtained by removing one of the conjuncts in A– Compare the pessimistic error rate for r against

all r’sPrune if one of the r’s has lower pessimistic error t

Introduction to Data Mining 08/26/2006 29

rate

– Repeat until pessimistic error rate can no longer be improved

Indirect Method: C4.5rules

Use class-based ordering– Rules that predict the same class are grouped

t th i t th b ttogether into the same subset– Compute total description length for each

class– Classes are ordered in increasing order of

their total description length

Introduction to Data Mining 08/26/2006 30

Page 16: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Example

C4.5rules:(Give Birth=No, Can Fly=Yes) → Birds

(Give Birth=No, Live in Water=Yes) → Fishes

GiveBirth?

Yes No ( , )

(Give Birth=Yes) → Mammals

(Give Birth=No, Can Fly=No, Live in Water=No) → Reptiles

( ) → Amphibians

Live InWater?

Can

Mammals

Fishes Amphibians

Yes

Sometimes

No

Introduction to Data Mining 08/26/2006 31

CanFly?

Fishes Amphibians

Birds Reptiles

Yes No

Characteristics of Rule-Based Classifiers

As highly expressive as decision treesEasy to interpretEasy to generateCan classify new instances rapidlyPerformance comparable to decision trees

Introduction to Data Mining 08/26/2006 32

Page 17: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Classification: Alternative Techniques

Instance-Based Classifiers

Introduction to Data Mining 08/26/2006 33

Instance-Based Classifiers

Atr1 ……... AtrN Class

Set of Stored Cases • Store the training records

• Use training records to predict the class label of

A

B

B

C

A Atr1 ……... AtrN

Unseen Case

predict the class label of unseen cases

Introduction to Data Mining 08/26/2006 34

C

B

Page 18: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Instance Based Classifiers

Examples:– Rote-learner

Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly

– Nearest neighborU k “ l t” i t ( t i hb ) f

Introduction to Data Mining 08/26/2006 35

Uses k “closest” points (nearest neighbors) for performing classification

Nearest Neighbor Classifiers

Basic idea:– If it walks like a duck, quacks like a duck, then

it’ b bl d kit’s probably a duck

Test Record

Compute Distance

Introduction to Data Mining 08/26/2006 36

Training Records

Choose k of the “nearest” records

Page 19: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Nearest-Neighbor Classifiers

Requires three things– The set of stored records– Distance metric to compute

distance between records

Unknown record

distance between records– The value of k, the number of

nearest neighbors to retrieve

To classify an unknown record:– Compute distance to other

training recordsId tif k t i hb

Introduction to Data Mining 08/26/2006 37

– Identify k nearest neighbors – Use class labels of nearest

neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Definition of Nearest Neighbor

X X X

Introduction to Data Mining 08/26/2006 38

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x

Page 20: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

1 nearest-neighbor

Voronoi Diagram

Introduction to Data Mining 08/26/2006 39

Nearest Neighbor Classification

Compute distance between two points:– Example: Euclidean distance

Determine the class from nearest neighbor list– take the majority vote of class labels among

∑ −=i ii

qpqpd 2)(),(

Introduction to Data Mining 08/26/2006 40

– take the majority vote of class labels among the k-nearest neighbors

– Weigh the vote according to distanceweight factor, w = 1/d2

Page 21: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Nearest Neighbor Classification…

Choosing the value of k:– If k is too small, sensitive to noise points

If k i t l i hb h d i l d i t f– If k is too large, neighborhood may include points from other classes

Introduction to Data Mining 08/26/2006 41

Nearest Neighbor Classification…

Scaling issues– Attributes may have to be scaled to prevent

di t f b i d i t d bdistance measures from being dominated by one of the attributes

– Example:height of a person may vary from 1.5m to 1.8mweight of a person may vary from 90lb to 300lb

$ $

Introduction to Data Mining 08/26/2006 42

income of a person may vary from $10K to $1M

Page 22: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Nearest Neighbor Classification…

Problem with Euclidean measure:– High dimensional data

curse of dimensionality

– Can produce counter-intuitive results

1 1 1 1 1 1 1 1 1 1 1 0

0 1 1 1 1 1 1 1 1 1 1 1

1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 1vs

Introduction to Data Mining 08/26/2006 43

0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1

d = 1.4142 d = 1.4142

Solution: Normalize the vectors to unit length

Nearest neighbor Classification…

k-NN classifiers are lazy learners – It does not build models explicitly– Unlike eager learners such as decision tree

induction and rule-based systems– Classifying unknown records are relatively

expensive

Introduction to Data Mining 08/26/2006 44

Page 23: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Example: PEBLS

PEBLS: Parallel Examplar-Based Learning System (Cost & Salzberg)

W k ith b th ti d i l– Works with both continuous and nominal features

For nominal features, distance between two nominal values is computed using modified value difference metric (MVDM)

– Each record is assigned a weight factor

Introduction to Data Mining 08/26/2006 45

Each record is assigned a weight factor– Number of nearest neighbor, k = 1

Example: PEBLS

Distance between nominal attribute values:

d(Single,Married) = | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No d(Single,Divorced) = | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0d(Married,Divorced) = | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1d(Refund=Yes,Refund=No)= | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Introduction to Data Mining 08/26/2006 46

ClassMarital Status

Single Married Divorced

Yes 2 0 1

No 2 4 1

∑ −=i

ii

nn

nnVVd

2

2

1

121 ),(Class

Refund

Yes No

Yes 0 3

No 3 4

Page 24: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Example: PEBLS

Tid Refund Marital Status

Taxable Income Cheat

X Yes Single 125K No

Y No Married 100K No

∑=

=Δd

iiiYX YXdwwYX

1

2),(),(

Y No Married 100K No10

Distance between record X and record Y:

where: predictionfor usedisXtimesofNumber w

Introduction to Data Mining 08/26/2006 47

correctly predicts X timesofNumber p

=Xw

wX ≅ 1 if X makes accurate prediction most of the time

wX > 1 if X is not reliable for making predictions

Classification: Alternative Techniques

Bayesian Classifiers

Introduction to Data Mining 08/26/2006 48

Page 25: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Classification: Alternative Techniques

Ensemble Methods

Introduction to Data Mining 8/26/2005 49

Ensemble Methods

Construct a set of classifiers from the training data

Predict class label of test records by combining the predictions made by multiple classifiers

Introduction to Data Mining 08/26/2006 50

Page 26: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Why Ensemble Methods work?

Suppose there are 25 base classifiers– Each classifier hasEach classifier has

error rate, ε = 0.35– Assume errors made

by classifiers are uncorrelated

– Probability that the ensemble classifier makes

Introduction to Data Mining 08/26/2006 51

ensemble classifier makes a wrong prediction:

∑=

− =−⎟⎟⎠

⎞⎜⎜⎝

⎛=≥

25

13

25 06.0)1(25

)13(i

ii

iXP εε

General Approach

Introduction to Data Mining 08/26/2006 52

Page 27: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Types of Ensemble Methods

Bayesian ensemble– Example: Mixture of Gaussian

Manipulate data distributionManipulate data distribution– Example: Resampling method

Manipulate input features– Example: Feature subset selection

Manipulate class labels

Introduction to Data Mining 08/26/2006 53

– Example: error-correcting output codingIntroduce randomness into learning algorithm– Example: Random forests

Bagging

Sampling with replacement

Build classifier on each bootstrap sample

Original Data 1 2 3 4 5 6 7 8 9 10Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

Introduction to Data Mining 08/26/2006 54

Each sample has probability (1 – 1/n)n of being selected

Page 28: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Bagging Algorithm

Introduction to Data Mining 08/26/2006 55

Bagging Example

Consider 1-dimensional data set:

Classifier is a decision stump– Decision rule: x ≤ k versus x > k– Split point k is chosen based on entropy

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1y 1 1 1 -1 -1 -1 -1 1 1 1

Introduction to Data Mining 08/26/2006 56

– Split point k is chosen based on entropy

x ≤ k

yleft yright

True False

Page 29: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Bagging Example

Bagging Round 1:x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9y 1 1 1 1 -1 -1 -1 -1 1 1

B i R d 2Bagging Round 2:x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1y 1 1 1 -1 -1 -1 1 1 1 1

Bagging Round 3:x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9y 1 1 1 -1 -1 -1 -1 -1 1 1

Bagging Round 4:x 0 1 0 1 0 2 0 4 0 4 0 5 0 5 0 7 0 8 0 9

Introduction to Data Mining 08/26/2006 57

x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9y 1 1 1 -1 -1 -1 -1 -1 1 1

Bagging Round 5:x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1y 1 1 1 -1 -1 -1 -1 1 1 1

Bagging Example

Bagging Round 6:x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1y 1 -1 -1 -1 -1 -1 -1 1 1 1

B i R d 7Bagging Round 7:x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1y 1 -1 -1 -1 -1 1 1 1 1 1

Bagging Round 8:x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1y 1 1 -1 -1 -1 -1 -1 1 1 1

Bagging Round 9:x 0 1 0 3 0 4 0 4 0 6 0 7 0 7 0 8 1 1

Introduction to Data Mining 08/26/2006 58

x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1y 1 1 -1 -1 -1 -1 -1 1 1 1

Bagging Round 10:x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9y 1 1 1 1 1 1 1 1 1 1

Page 30: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Bagging Example

Summary of Training sets:

Round Split Point Left Class Right Class1 0.35 1 -12 0.7 1 13 0.35 1 -14 0.3 1 -15 0.35 1 -16 0.75 -1 17 0.75 -1 18 0 75 1 1

Introduction to Data Mining 08/26/2006 59

8 0.75 -1 19 0.75 -1 110 0.05 1 1

Bagging Example

Assume test set is the same as the original dataUse majority vote to determine class of ensemble l ificlassifier

Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.01 1 1 1 -1 -1 -1 -1 -1 -1 -12 1 1 1 1 1 1 1 1 1 13 1 1 1 -1 -1 -1 -1 -1 -1 -14 1 1 1 -1 -1 -1 -1 -1 -1 -15 1 1 1 -1 -1 -1 -1 -1 -1 -16 -1 -1 -1 -1 -1 -1 -1 1 1 1

Introduction to Data Mining 08/26/2006 60

7 -1 -1 -1 -1 -1 -1 -1 1 1 18 -1 -1 -1 -1 -1 -1 -1 1 1 19 -1 -1 -1 -1 -1 -1 -1 1 1 110 1 1 1 1 1 1 1 1 1 1

Sum 2 2 2 -6 -6 -6 -6 2 2 2Sign 1 1 1 -1 -1 -1 -1 1 1 1Predicted

Class

Page 31: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Boosting

An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified recordspreviously misclassified records– Initially, all N records are assigned equal

weights– Unlike bagging, weights may change at the

end of each boosting round

Introduction to Data Mining 08/26/2006 61

Boosting

Records that are wrongly classified will have their weights increasedR d th t l ifi d tl ill hRecords that are classified correctly will have their weights decreased

Original Data 1 2 3 4 5 6 7 8 9 10Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

Introduction to Data Mining 08/26/2006 62

• Example 4 is hard to classify

• Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds

Page 32: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

AdaBoost

Base classifiers: C1, C2, …, CT

Error rate:

Importance of a classifier:

( )∑=

≠=N

jjjiji yxCw

N 1

)(1 δε

Introduction to Data Mining 08/26/2006 63

⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

i

ii ε

εα 1ln21

AdaBoost Algorithm

Weight update:

)(ifexp)()1( iij

jij yxCw j⎪

⎨⎧ =−

If any intermediate rounds produce error rate higher than 50%, the weights are reverted back

factor ionnormalizat theis where

)( ifexp)(

j

iij

iij

j

iji

Z

yxCy

Zw

j⎪⎩⎨

≠= α

Introduction to Data Mining 08/26/2006 64

to 1/n and the resampling procedure is repeatedClassification:

( )∑=

==T

jjj

yyxCxC

1

)(maxarg)(* δα

Page 33: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

AdaBoost Algorithm

Introduction to Data Mining 08/26/2006 65

AdaBoost Example

Consider 1-dimensional data set:

Classifier is a decision stump– Decision rule: x ≤ k versus x > k– Split point k is chosen based on entropy

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1y 1 1 1 -1 -1 -1 -1 1 1 1

Introduction to Data Mining 08/26/2006 66

– Split point k is chosen based on entropy

x ≤ k

yleft yright

True False

Page 34: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

AdaBoost Example

Training sets for the first 3 boosting rounds:Boosting Round 1:

x 0 1 0 4 0 5 0 6 0 6 0 7 0 7 0 7 0 8 1x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1y 1 -1 -1 -1 -1 -1 -1 -1 1 1

Boosting Round 2:x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3y 1 1 1 1 1 1 1 1 1 1

Boosting Round 3:x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7y 1 1 -1 -1 -1 -1 -1 -1 -1 -1

Introduction to Data Mining 08/26/2006 67

Summary:

y

Round Split Point Left Class Right Class alpha1 0.75 -1 1 1.7382 0.05 1 1 2.77843 0.3 1 -1 4.1195

AdaBoost Example

WeightsRound x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0

1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Classification

2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.013 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009

Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.01 -1 -1 -1 -1 -1 -1 -1 1 1 1

Introduction to Data Mining 08/26/2006 68

2 1 1 1 1 1 1 1 1 1 13 1 1 1 -1 -1 -1 -1 -1 -1 -1

Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397Sign 1 1 1 -1 -1 -1 -1 1 1 1Predicted

Class

Page 35: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Classification: Alternative Techniques

Imbalanced Class Problem

Introduction to Data Mining 8/26/2005 69

Class Imbalance Problem

Lots of classification problems where the classes are skewed (more records from one class than another)another)– Credit card fraud– Intrusion detection– Defective products in manufacturing assembly

line

Introduction to Data Mining 08/26/2006 70

Page 36: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Challenges

Evaluation measures such as accuracy is not well-suited for imbalanced class

Detecting the rare class is like finding needle in a haystack

Introduction to Data Mining 08/26/2006 71

Confusion Matrix

Confusion Matrix:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

Introduction to Data Mining 08/26/2006 72

a: TP (true positive)

b: FN (false negative)

c: FP (false positive)

d: TN (true negative)

Page 37: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Accuracy

PREDICTED CLASS

Class=Yes Class=No

Most widely-used metric:

ACTUALCLASS

Class=Yes a(TP)

b(FN)

Class=No c(FP)

d(TN)

Introduction to Data Mining 08/26/2006 73

Most widely-used metric:

FNFPTNTPTNTP

dcbada

++++=

++++=Accuracy

Problem with Accuracy

Consider a 2-class problem– Number of Class 0 examples = 9990

Number of Class 1 examples = 10– Number of Class 1 examples = 10

If a model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %– This is misleading because the model does

not detect any class 1 example

Introduction to Data Mining 08/26/2006 74

not detect any class 1 example– Detecting the rare class is usually more

interesting (e.g., frauds, intrusions, defects, etc)

Page 38: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Alternative Measures

PREDICTED CLASSClass=Yes Class=No

Cl Y b

caa+

= (p)Precision

ACTUALCLASS

Class=Yes a(TP)

b(FN)

Class=No c(FP)

d(TN)

Introduction to Data Mining 08/26/2006 75

cbaa

prrp

baa

++=

+=

+=

222(F) measure-F

(r) Recall

ROC (Receiver Operating Characteristic)

A graphical approach for displaying trade-off between detection rate and false alarm rateD l d i 1950 f i l d t ti th tDeveloped in 1950s for signal detection theory to analyze noisy signals ROC curve plots TPR against FPR– TPR = TP/(TP+FN), FPR = FP/(TN+FP)– Performance of a model represented as a

Introduction to Data Mining 08/26/2006 76

point in an ROC curve– Changing the threshold parameter of classifier

changes the location of the point

Page 39: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

ROC Curve

(TPR,FPR):(0,0): declare everything

to be negative classto be negative class(1,1): declare everything

to be positive class(1,0): ideal

Diagonal line:

Introduction to Data Mining 08/26/2006 77

– Random guessing– Below diagonal line:

prediction is opposite of the true class

ROC (Receiver Operating Characteristic)

To draw ROC curve, classifier must produce continuous-valued output

Outputs are used to rank test records from the most– Outputs are used to rank test records, from the most likely positive class record to the least likely positive class record

Many classifiers produce only discrete outputs (i.e., predicted class)

Introduction to Data Mining 08/26/2006 78

– How to get continuous-valued outputs?Decision trees, rule-based classifiers, neural networks,

Bayesian classifiers, k-nearest neighbors, SVM

Page 40: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Example: Decision Trees

Decision Tree

C ti l d t tContinuous-valued outputs

Introduction to Data Mining 08/26/2006 79

ROC Curve Example

Introduction to Data Mining 08/26/2006 80

Page 41: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Using ROC for Model Comparison

No model consistently outperform the other

M1 is better forM1 is better for small FPRM2 is better for large FPR

Area Under the ROC curve

Introduction to Data Mining 08/26/2006 81

Ideal: Area = 1

Random guess:Area = 0.5

How to Construct an ROC curve

Instance score(+|A) True Class1 0.95 +2 0 93 +

• Use classifier that produces continuous-valued output for each test instance score(+|A)

2 0.93 +3 0.87 -4 0.85 -5 0.85 -6 0.85 +7 0.76 -8 0 53 +

• Sort the instances according to score(+|A) in decreasing order

• Apply threshold at each unique value of score(+|A)

Introduction to Data Mining 08/26/2006 82

8 0.539 0.43 -

10 0.25 +

• Count the number of TP, FP, TN, FN at each threshold

•TPR = TP/(TP+FN)

•FPR = FP/(FP + TN)

Page 42: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

How to construct an ROC curve

Class + - + - - - + - + +

0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

Threshold >=

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

Introduction to Data Mining 08/26/2006 83

ROC Curve:

Handling Class Imbalanced Problem

Class-based ordering (e.g. RIPPER)– Rules for rare class have higher priority

Cost-sensitive classification– Misclassifying rare class as majority class is

more expensive than misclassifying majority as rare class

Introduction to Data Mining 08/26/2006 84

Sampling-based approaches

Page 43: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Cost Matrix

PREDICTED CLASS

ACTUALClass=Yes Class=No

CLASS Class=Yes f(Yes, Yes) f(Yes,No)

Class=No f(No, Yes) f(No, No)

Cost Matrix

PREDICTED CLASS

C(i j) Cl Y Cl N

C(i,j): Cost of misclassifying class i example as class j

∑ )()(C t jifjiC

Introduction to Data Mining 08/26/2006 85

ACTUALCLASS

C(i, j) Class=Yes Class=No

Class=Yes C(Yes, Yes) C(Yes, No)

Class=No C(No, Yes) C(No, No)

∑ ×= ),(),(Cost jifjiC

Computing Cost of Classification

Cost Matrix

PREDICTED CLASS

ACTUALC(i,j) + -

ACTUALCLASS

+ -1 100- 1 0

Model M1

PREDICTED CLASS

ACTUAL+ -

+ 150 40

Model M2

PREDICTED CLASS

ACTUAL+ -

+ 250 45

Introduction to Data Mining 08/26/2006 86

CLASS+ 150 40- 60 250

CLASS+ 250 45- 5 200

Accuracy = 80%Cost = 3910

Accuracy = 90%Cost = 4255

Page 44: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Cost Sensitive Classification

Example: Bayesian classifer– Given a test record x:

Compute p(i|x) for each class iDecision rule: classify node as class k if

)|(maxarg xipki

=

Introduction to Data Mining 08/26/2006 87

– For 2-class, classify x as + if p(+|x) > p(-|x)This decision rule implicitly assumes that

C(+|+) = C(-|-) = 0 and C(+|-) = C(-|+)

Cost Sensitive Classification

General decision rule: – Classify test record x as class k if

2-class:– Cost(+) = p(+|x) C(+,+) + p(-|x) C(-,+)– Cost(-) = p(+|x) C(+,-) + p(-|x) C(-,-)

∑ ×=ij

jiCxipk ),()|(minarg

Introduction to Data Mining 08/26/2006 88

( ) ( | ) ( ) ( | ) ( )– Decision rule: classify x as + if Cost(+) < Cost(-)

if C(+,+) = C(-,-) = 0:

),(),(),()|(

−+++−+−

>+CC

Cxp

Page 45: Classification: Alternative Techniquesdatamining.rutgers.edu/teaching/fall2014/DM/lecture5.pdf · Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction

Sampling-based Approaches

Modify the distribution of training data so that rare class is well-represented in training set

U d l th j it l– Undersample the majority class– Oversample the rare class

Advantages and disadvantages

Introduction to Data Mining 08/26/2006 89