Top Banner
Frequent Closed Pattern Search By Row and Feature Enumeration
38

Frequent Closed Pattern Search By Row and Feature Enumeration.

Dec 14, 2015

Download

Documents

Prince Reeds
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Frequent Closed Pattern Search By Row and Feature Enumeration.

Frequent Closed Pattern Search By Row and Feature Enumeration

Page 2: Frequent Closed Pattern Search By Row and Feature Enumeration.

Outline Problem Definition

Related Work: Feature Enumeration Algorithms

CARPENTER: Row Enumeration Algorithm

COBBLER: Combined Enumeration Algorithm

Page 3: Frequent Closed Pattern Search By Row and Feature Enumeration.

Problem Definition Frequent Closed Pattern:

1) frequent pattern: has support value higher than the threshold

2) closed pattern: there exists no superset which has the same support value

Problem Definition:Given a dataset D which contains records consist of features, our problem is to discover all frequent closed patterns respect to a user defined support threshold.

Page 4: Frequent Closed Pattern Search By Row and Feature Enumeration.

Related Work

Searching Strategy:breadth-first & depth-first search

Data Format:horizontal format & vertical format

Data Compression Method:diffset, fp-tree, etc.

Page 5: Frequent Closed Pattern Search By Row and Feature Enumeration.

Typical Algorithms APRIORI

feature enumeration horizontal format breadth-first search

CHARM feature enumeration vertical format depth-first search deffset technique

CLOSET feature

enumeration horizontal format depth-first search fp-tree technique

Page 6: Frequent Closed Pattern Search By Row and Feature Enumeration.

CARPENTERCARPENTER stands for Closed Pattern Discovery by Transposing Tables that are Extremely Long

Motivation

Algorithm

Prune Method

Experiment

Page 7: Frequent Closed Pattern Search By Row and Feature Enumeration.

Motivation Bioinformatic datasets typically contain large

number of features with small number of rows.

Running time of most of the previous algorithms will increase exponentially with the average length of the transactions.

CARPENTER’s search space is much smaller than that of the previous algorithms on these kind of datasets and therefore has a better performance.

Page 8: Frequent Closed Pattern Search By Row and Feature Enumeration.

Algorithm

The main idea of CARPENTER is to mine the dataset row-wise.

2 steps: First, transpose the dataset Second , search in the row enumeration tree.

Page 9: Frequent Closed Pattern Search By Row and Feature Enumeration.

Transpose Table Feature a, b, c, d. Row r1, r2 , r3, r4.

r1 a b c

r2 b c d

r3 b c d

r4 d

a r1

b r1 r2 r3

c r1 r2 r3

d r2 r3 r4

original table transposed table

transpose

b

c

d r4project on (r2 r3)

projected table

Page 10: Frequent Closed Pattern Search By Row and Feature Enumeration.

Row Enumeration Tree According to the

transposed table, we build the row enumeration tree which enumerates row ids with a pre-defined order.

We do a depth first search in the row enumeration tree with out any prune strategies.

{ }

r1 {abc}

r2 {bcd}

r3 {bcd}

r4 {d}

r1 r2 {bc}

r1 r3 {bc}

r1 r4 {}

r1 r2 r3 {bc}

r1 r2 r4 {}

r1 r3 r4 { }

r2 r3 {bcd}

r2 r4 {d}

r2 r3 r4 {d }

r3 r4 {d }

r1r2r3r4 { }

minsup=2

bc: r1r2r3

bcd: r2r3

d: r2r3r4

a r1

b r1 r2 r3

c r1 r2 r3

d r2 r3 r4

Page 11: Frequent Closed Pattern Search By Row and Feature Enumeration.

Prune Method 1 In the

enumeration tree, the depth of a node is the corresponding support value.

Prune a branch if there won’t be enough depth in that branch, which means the support of patterns found in the branch will not exceed the minimum support.

r2 {bcd}

r2 r3 {bcd}

r2 r4 {d}

depth= 1

sup =1

2 sub-nodes

Max support value in branch “r2” will be 3, therefore prune this branch.

minsup 4

Page 12: Frequent Closed Pattern Search By Row and Feature Enumeration.

Prune Method 2 If rj has 100% support in

projects table of ri, prune

the branch of rj.

b r3

c r3

d r3 r4

r2 {bcd} r2 r3 {bcd}

r2 r4 {d}

r3 has 100% support in the projected table of “r2”, therefore branch “r2 r3” will be pruned and whole branch is reconstructed.

r2 r3 r4 {d}

r2 r3 {bcd} r2 r3 r4 {d}

b

c

d r4

Page 13: Frequent Closed Pattern Search By Row and Feature Enumeration.

Prune Method 3 At any node in

the enumeration tree, if the corresponding itemset of the node has been found before, we prune the branch rooted at this node.

r2 {bcd} r2 r3 {bcd}

r2 r4 {d}

r3 {bcd} r3 r4 {d}

Since itemset {bcd} has been found before, the branch rooted at “r3” will be pruned.

Page 14: Frequent Closed Pattern Search By Row and Feature Enumeration.

Performance We compare 3 algorithms,

CARPENTER, CHARM and CLOSET.

Dataset (Lung Cancer) has 181 rows with 12533 features.

We set 3 parameters, minsup, Length Ratio and Row Ratio.

Page 15: Frequent Closed Pattern Search By Row and Feature Enumeration.

minsupLung Cancer, 181 rows, length ratio 0.6,row ratio 1.

Running time of CARNPENTER changes from 3 to 14 second

0.1

1

10

100

1000

10000

100000

4 5 6 7 8 9 10

minsup

Run

time

(sec

.)

carpenter (sec)charm (sec)closet (sec)

Page 16: Frequent Closed Pattern Search By Row and Feature Enumeration.

Length RatioLung Cancer, 181 rows, sup 7 (4%), row ratio 1

Running time of CARPENTER changes from 3 to 33 seconds

1

10

100

1000

10000

100000

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Length Ratio

Ru

nti

me

(sec

.)

carpenter (sec)charm (sec)closet (sec)

Page 17: Frequent Closed Pattern Search By Row and Feature Enumeration.

Row RatioLung Cancer, 181 rows, length ratio 0.6,sup 7 (4%)

Running time of CARPENTER changes from 9 to 178 seconds

1

10

100

1000

10000

0 1 2 3 4 5 6

Row Ratio

Ru

nti

me

(sec

.)

carpenter (sec)charm (sec)closet (sec)

Page 18: Frequent Closed Pattern Search By Row and Feature Enumeration.

Conclusion We propose an algorithm call CARPENTER for

finding closed pattern on long biological datasets.

CARPENTER perform row enumeration instead of column enumeration since the number of rows in such datasets are significantly smaller than the number of features.

Performance studies show that CARPENTER is much more efficient in finding closed patterns compared to existing feature enumeration algorithms.

Page 19: Frequent Closed Pattern Search By Row and Feature Enumeration.

COBBLER

Motivation

Algorithm

Performance

Page 20: Frequent Closed Pattern Search By Row and Feature Enumeration.

Motivation With the development of CARPENTER,

existing algorithms can be separated into two parts. Feature enumeration: CHARM, CLOSET, etc. Row enumeration: CARPENTER

We have two motivations to combine these two enumeration methods

Page 21: Frequent Closed Pattern Search By Row and Feature Enumeration.

Motivation1. We can see that these two enumeration methods have

their own advantages on different type of data set. Given a dataset, the characteristic of its sub-dataset may change.

2. Given a dataset with both large number of rows and features, a single row enumeration algorithm or a single feature enumeration method can not handle the dataset.

dataset

sub-dataset

more rows than features more features than rows

project

Page 22: Frequent Closed Pattern Search By Row and Feature Enumeration.

Algorithm There are two main points in the

COBBLER algorithm How to build an enumeration tree for

COBBLER. How to decide when the algorithm should

switch from one enumeration to another.

Therefore, we will introduce the idea of dynamic enumeration tree and switching condition

Page 23: Frequent Closed Pattern Search By Row and Feature Enumeration.

Dynamic Enumeration Tree We call the new kind of enumeration tree used

in COBBLER the dynamic enumeration tree. In dynamic enumeration tree, different sub-tree

may use different enumeration method.

We use the table as an example in later discussion

r1

a b c

r2

a c d

r3

b c

r4

d

a r1 r2

b r1 r3

c r1 r2 r3

d r2 r4

original transposed

Page 24: Frequent Closed Pattern Search By Row and Feature Enumeration.

{ }

a {r1r2}

b {r1r3}

c {r1r2r3}

d {r2r4

}

ab {r1}

ac {r1r2}

ad {r2}

abc {r1}

abd { }

acd { r2}

bc {r1r3}

bd { }

bcd { }

cd {r2 }

{ }

r1 {abc}

r2 {acd}

r3 {bc}

r4 {d}

r1r2 {ac}

r1r3 {bc}

r1r4 { }

r1r2r3 {c}

r1r2r4 { }

r1r3r4 { }

r2r3 {c}

r2r4 {d }

r2r3r4 { }

r3r4 { }

Single Enumeration Tree r1r2r3r4

{ } abcd

{ }

Feature enumeration Row enumeration

r1 a b c

r2 a c d

r3 b c

r4 d

Page 25: Frequent Closed Pattern Search By Row and Feature Enumeration.

Dynamic Enumeration Tree

{ }

a {r1r2}

b {r1r3}

c {r1r2r3}

d {r2r4

}

r1 {bc}

r2 {cd}

r1r2 {c}

r1 {c}

r3 { c}

r1r3 { c}

r2 {d }

Feature enumeration to Row enumeration

ab {r1}

ac {r1r2}

ad {r2}

abc {r1}

abd { }

acd { r2}

abcd { }

a {r1r2}

abc: {r1}

ac: {r1r2}

acd: {r2}

r1

bc

r2

cd

b r1

c r1 r2

d r2

Page 26: Frequent Closed Pattern Search By Row and Feature Enumeration.

Dynamic Enumeration Tree

{ }

r1 {abc}

r2 {acd}

r3 {bc}

r4 {d}

a {r2}

b {r3}

c {r2r3 }

ab {}

ac { r2}

bc {r3 }

a {r1}

d {r4 }

ac {r1 }

b {r1 }

Row enumeration to Feature Enumeration

c {r1r3}

ad { }

acd { }

cd { }

c {r1r2 }

bc {r1 }

r1 {abc}

r1r2 {ac}

r1r3 {bc}

r1r4 { }

r1r2r3 {c}

r1r2r4 { }

r1r3r4 { }

r1r2r3r4 { }

ac: {r1r2}

bc: {r1r3}

c: {r1r2r3}

Page 27: Frequent Closed Pattern Search By Row and Feature Enumeration.

Dynamic Enumeration Tree When we use different condition to decide the

switching, the structure of the dynamic enumeration tree will change.

No matter how it switches, the result set of closed pattern will be the same as the result of

the single enumeration .

Page 28: Frequent Closed Pattern Search By Row and Feature Enumeration.

Switching Condition The main idea of the switching condition is to

estimate the processing time of the a enumeration sub-tree, i.e., row enumeration sub-tree or feature enumeration sub-tree.

Define some characters.

Page 29: Frequent Closed Pattern Search By Row and Feature Enumeration.

Switching Condition

f1 f2 f3 fn

{f1,f2} {f1,fn}...

{f1,f2,f3}...

{f1,f2..fp}Deepest node

under f1

{f2,f3} {f2,fn}...

{f2,f3,f4} ...

{f2..fq}Deepest node

under f2

{f3,f4} {f3,fn}...

.

.

.

{f3..fk}Deepest node

under f3

{f3,f4,f5}

... f1 f2 f3 fn

{f1,f2}

.

.

.

{f1,f2..fp}Deepest node

under f1

{f2,f3}

.

.

.

{f2..fq}Deepest node

under f2

{f3,f4}

.

.

.

{f3..fk}Deepest node

under f3

...

Page 30: Frequent Closed Pattern Search By Row and Feature Enumeration.

Switching Condition

Suppose r=10, S(f1)=0.8, S(f2)=0.5, S(f3)=0.5, S(f4)=0.3 and minsup=2

Then the estimated deepest node under f1 is f1f2f3, since S(f1)*S(f2)*S(f3)*r=2 >minsup S(f1)*S(f2)*S(f3)*S(f4)*r=0.6 < minsup

Page 31: Frequent Closed Pattern Search By Row and Feature Enumeration.

Experiments We compare 3 algorithms,

COBBLER, CHARM and CLOSET+.

One real-life dataset and one synthetic data.

We set 3 parameters, minsup, Length Ratio and Row Ratio.

Page 32: Frequent Closed Pattern Search By Row and Feature Enumeration.

minsup

0

5000

10000

15000

20000

25000

30000

35000

40000

14% 16% 18% 20%

Minimum Support

Ru

nti

me

(sec

.)

COBBLER

CLOSET+

CHARM

0

1000

2000

3000

4000

5000

6% 8% 10% 12% 14% 16%

Minimum Support

Ru

nti

me

(s

ec

.)

COBBLER

CLOSET+

CHARM

Synthetic data Real-life data (thrombin)

Page 33: Frequent Closed Pattern Search By Row and Feature Enumeration.

Length and Row ratio

0

2000

4000

6000

8000

10000

12000

14000

0.75 0.8 0.85 0.9 0.95 1 1.05

Length Ratio

Ru

nti

me

(sec

.)

COBBLER

CLOSET+

CHARM

0

10000

20000

30000

40000

50000

60000

70000

80000

0.5 1 1.5 2

Row Ratio

Ru

nti

me

(sec

.)

COBBLER

CLOSET+

CHARM

Synthetic data

Page 34: Frequent Closed Pattern Search By Row and Feature Enumeration.

Discussion The combination of row and feature

enumeration also makes some disadvantage

The cost to calculate the switching condition and the cost of bad decision.

The increased cost in pruning, maintain two set of pruning system.

Page 35: Frequent Closed Pattern Search By Row and Feature Enumeration.

Discussion We may use other more complicated

data structure in our algorithm to improve the performance, e.g., the vertical data format and diffset technique.

And more efficient switching condition may improve the algorithm further.

Page 36: Frequent Closed Pattern Search By Row and Feature Enumeration.

Conclusion The COBBLER algorithm gives better

performance on dataset where the advantage of switching can be shown, e.g., complex dataset or dataset has both large number of rows and features.

For simple characteristic data, a single enumeration algorithm may be better.

Page 37: Frequent Closed Pattern Search By Row and Feature Enumeration.

Future Work

Using other data structure and technique in the algorithm.

Extend COBBLER to handle dataset that can not be fitted into memory.

Page 38: Frequent Closed Pattern Search By Row and Feature Enumeration.

Thanks