Top Banner
Direct Mining of Discriminative and Essential Frequent Patterns via Model- based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei Han, Philip S. Yu, Olivier Verscheure o find good features from semi-structured r lassification
24

Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Mar 27, 2015

Download

Documents

Blake Berry
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree

Wei Fan, Kun Zhang, Hong Cheng,

Jing Gao, Xifeng Yan, Jiawei Han,

Philip S. Yu, Olivier Verscheure

How to find good features from semi-structured raw data for classification

Page 2: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Feature Construction

Most data mining and machine learning model assume the following structured data: (x1, x2, ..., xk) -> y where xi’s are independent variable y is dependent variable.

y drawn from discrete set: classification y drawn from continuous variable: regression

When feature vectors are good, differences in accuracy among learners are not much.

Questions: where do good features come from?

Page 3: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Frequent Pattern-Based Feature Extraction

Data not in the pre-defined feature vectors Transactions

Biological sequence

Graph database

Frequent pattern is a good candidate for discriminative features So, how to mine them?

Page 4: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

FP: Sub-graphO

A discovered pattern

HO

O

NSC 4960

NSC 191370

O O

NH

O

HN

O

O

SH

NSC 40773

O

O

O

HO

O

HO

O

O

NSC 164863 NS

H2N O

OOO

O

O O

O

OO

OH

O

NSC 699181

(example borrowed from George Karypis presentation)

Page 5: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Frequent Pattern Feature Vector Representation

P1 P2 P3

Data1 1 1 0Data2 1 0 1Data3 1 1 0Data4 0 0 1

………

|

Petal.Width< 1.75setosa

versicolor virginica

Petal.Length< 2.45

Any classifiers you can name

NN

DT

SVM

LRMining these predictivefeatures is an NP-hardproblem.

100 examples can get up to1010 patterns

Most are useless

Page 6: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Example 192 examples

12% support (at least 12% examples contain the pattern), 8600 patterns returned by itemsets 192 vs 8600 ?

4% support, 92,000 patterns 192 vs 92,000 ??

Most patterns have no predictive power and cannot be used to construct features.

Our algorithm Find only 20 highly predictive patterns can construct a decision tree with about 90% accuracy

Page 7: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Data in “bad” feature space Discriminative patterns

A non-linear combination of single feature(s) Increase the expressive and discriminative power of the

feature space

An example

X Y C

0 0 0

1 1 1

-1 1 1

1 -1 1

-1 -1 1Data is non-linearly separable in (x, y)

0

1

1

x

y

1

1

Page 8: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

New Feature Space

Data is linearly separable in (x, y, F)

Mine &

Transform

• Solving Problem

Map

Dat

a to

a D

iffer

ent S

pace

X Y C

0 0 0

1 1 1

-1 1 1

1 -1 1

-1 -1 1X Y F:x=0,

y=0 C

0 0 1 0

1 1 0 1

-1 1 0 1

1 -1 0 1

-1 -1 0 1

01 x

y

1

1

1

1

F

0

11

1ItemSet:F: x=0,y=0Association ruleF: x=0 y=0

Page 9: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Computational Issues Measured by its “frequency” or support.

E.g. frequent subgraphs with sup ≥ 10% or ≥ 10% examples contain these patterns

“Ordered” enumeration: cannot enumerate “sup = 10%” without first enumerating all patterns > 10%.

NP hard problem, easily up to 1010 patterns for a realistic problem. Most Patterns are Non-discriminative. Low support patterns can have high “discriminative power”. Bad! Random sampling not work since it is not exhaustive.

Most patterns are useless. Random sample patterns (or blindly enumerate without considering frequency) is useless.

Small number of examples. If subset of vocabulary, incomplete search. If complete vocabulary, won’t help much but introduce sample selection bias

problem, particularly to miss low support but high info gain patterns

Page 10: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

1. Mine frequent patterns (>sup)

Frequent Patterns1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------

DataSet mine

Mined Discriminative

Patterns

1 2 4

select

2. Select most discriminative patterns;

3. Represent data in the feature space using such patterns;

4. Build classification models.

F1 F2 F4

Data1 1 1 0Data2 1 0 1Data3 1 1 0Data4 0 0 1

………represent

|

Petal.Width< 1.75setosa

versicolor virginica

Petal.Length< 2.45

Any classifiers you can name

NN

DT

SVM

LR

Conventional Procedure

Feature Construction and Selection

Two-Step Batch Method

Page 11: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Two Problems

Mine step combinatorial explosion

Frequent Patterns

1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------

DataSetmine

1. exponential explosion 2. patterns not considered if minsupport isn’t small

enough

Page 12: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Two Problems Select step

Issue of discriminative power

Frequent Patterns

1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------

Mined Discriminative

Patterns

1 2 4

select

3. InfoGain against the complete dataset, NOT on subset of

examples

4. Correlation notdirectly evaluated on their

joint predictability

Page 13: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Direct Mining & Selection via Model-based Search Tree Basic Flow

Mined Discriminative Patterns

Compact set of highly

discriminative patterns

1234567...

Divide-and-Conquer Based Frequent Pattern Mining

2

Mine & Select P: 20%

Y

3

Mine & Select P:20%

Y

6

Y

+

Y Y4

N

Few Data

N N

+

N

5

N

Mine & Select P:20%

7

N

Mine & Select P:20%

… Y

dataset

1

Mine & Select P: 20%

Most discriminative F based on IG

Feature Miner

Classifier

Global Support:

10*20%/10000=0.02%

Page 14: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Analyses (I)

1. Scalability (Theorem 1)

Upper bound

“Scale down” ratio to obtain extremely low support pat:

2. Bound on number of returned features (Theorem 2)

Page 15: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

4. Non-overfitting5. Optimality under exhaustive search

Analyses (II)

3. Subspace is important for discriminative pattern

Original set: no-information gain if C1 and C0: number of examples belonging to class 1 and 0 P1: number of examples in C1 that contains “a pattern α” P0: number of examples in C0 that contains the same pattern α

Subsets could have info gain:

Page 16: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Experimental Studies: Itemset Mining (I)

Scalability Comparison

01

23

4

Adult Chess Hypo Sick Sonar

Log(DT #Pat) Log(MbT #Pat)

0

1

2

3

4

Adult Chess Hypo Sick Sonar

Log(DTAbsSupport) Log(MbTAbsSupport)

Datasets #Pat using MbT supRatio (MbT #Pat / #Pat using MbT

sup)

Adult 252809 0.41%

Chess +∞ ~0%

Hypo 423439 0.0035%

Sick 4818391 0.00032%

Sonar 95507 0.00775%

2

Mine & Select P: 20%

Y

3

Mine & Select P:20%

Y

+

Y Y

Few Data

N

+

N

dataset

1

Mine & Select P: 20%

Most discriminative F based on IG

Global Support:

10*20%/10000=0.02%

6

Y

5

N

Mine & Select P:20%

7

N

Mine & Select P:20%4

N

2

Mine & Select P: 20%

Y

3

Mine & Select P:20%

Y

+

Y Y

Few Data

N

+

N

dataset

1

Mine & Select P: 20%

Most discriminative F based on IG

Global Support:

10*20%/10000=0.02%

6

Y

5

N

Mine & Select P:20%

7

N

Mine & Select P:20%4

N

Page 17: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Experimental Studies: Itemset Mining (II)

Accuracy of Mined Itemsets

70%

80%

90%

100%

Adult Chess Hypo Sick Sonar

DT Accuracy MbT Accuracy

4 Wins 1 loss

01

23

4

Adult Chess Hypo Sick Sonar

Log(DT #Pat) Log(MbT #Pat)

much smallernumber ofpatterns

Page 18: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Experimental Studies: Itemset Mining (III)

Convergence

Page 19: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Experimental Studies: Graph Mining (I)

9 NCI anti-cancer screen datasets The PubChem Project. URL: pubchem.ncbi.nlm.nih.gov. Active (Positive) class : around 1% - 8.3%

2 AIDS anti-viral screen datasets URL: http://dtp.nci.nih.gov. H1: CM+CA – 3.5% H2: CA – 1%

O

O

O

HO

O

HO

O

O

Page 20: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Experimental Studies: Graph Mining (II) Scalability

0300600900

120015001800

NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2

DT #Pat MbT #Pat

0

1

2

3

4

NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2

Log(DT Abs Support) Log(MbT Abs Support)2

Mine & Select P: 20%

Y

3

Mine & Select P:20%

Y

+

Y Y

Few Data

N

+

N

dataset

1

Mine & Select P: 20%

Most discriminative F based on IG

Global Support:

10*20%/10000=0.02%

6

Y

5

N

Mine & Select P:20%

7

N

Mine & Select P:20%4

N

2

Mine & Select P: 20%

Y

3

Mine & Select P:20%

Y

+

Y Y

Few Data

N

+

N

dataset

1

Mine & Select P: 20%

Most discriminative F based on IG

Global Support:

10*20%/10000=0.02%

6

Y

5

N

Mine & Select P:20%

7

N

Mine & Select P:20%4

N

Page 21: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Experimental Studies: Graph Mining (III) AUC and Accuracy

0.5

0.6

0.7

0.8

NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2

DT MbTAUC

Accuracy

0.88

0.92

0.96

1

NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2

DT MbT

11 Wins

10 Wins 1 Loss

Page 22: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

AUC of MbT, DT MbT VS Benchmarks

Experimental Studies: Graph Mining (IV)

7 Wins, 4 losses

Page 23: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

Summary Model-based Search Tree

Integrated feature mining and construction. Dynamic support Can mine extremely small support patterns Both a feature construction and a classifier Not limited to one type of frequent pattern: plug-play

Experiment Results Itemset Mining Graph Mining

Software and Dataset available from: www.cs.columbia.edu/~wfan

Page 24: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.