Top Banner
CloSpan: Mining Closed CloSpan: Mining Closed Sequential Patterns in Large Sequential Patterns in Large Datasets Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Proceedings of 2003 SIAM International Conference on Data Mining (SDM'03), pp. 166- Conference on Data Mining (SDM'03), pp. 166- 177, San Fransisco, CA, May 2003. 177, San Fransisco, CA, May 2003. Advisor: Professor Hsin-Hsi Chen Advisor: Professor Hsin-Hsi Chen Reporter: Clarence Min-Chi Hsieh Reporter: Clarence Min-Chi Hsieh Natural Language Processing Laboratory, Natural Language Processing Laboratory, Dept. of Computer Science and Info. Engineering, NTU Dept. of Computer Science and Info. Engineering, NTU 2006/01/10 2006/01/10
51

CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

Dec 13, 2015

Download

Documents

John Brown
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

CloSpan: Mining Closed CloSpan: Mining Closed Sequential Patterns in Large Sequential Patterns in Large

DatasetsDatasets

Xifeng Yan, Jiawei Han and Ramin AfsharXifeng Yan, Jiawei Han and Ramin Afshar

Proceedings of 2003 SIAM International Proceedings of 2003 SIAM International Conference on Data Mining (SDM'03), pp. 166-Conference on Data Mining (SDM'03), pp. 166-

177, San Fransisco, CA, May 2003.177, San Fransisco, CA, May 2003.

Advisor: Professor Hsin-Hsi ChenAdvisor: Professor Hsin-Hsi ChenReporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh

Natural Language Processing Laboratory,Natural Language Processing Laboratory,Dept. of Computer Science and Info. Engineering, NTUDept. of Computer Science and Info. Engineering, NTU

2006/01/102006/01/10

Page 2: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 22Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

OutlineOutlineIntroductionIntroductionSearch Space PruningSearch Space PruningCloSpanCloSpanExperimental ResultsExperimental ResultsConclusionsConclusions

Page 3: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 33Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

IntroductionIntroduction Apriori-like algorithm will generate a huge Apriori-like algorithm will generate a huge

set of candidate sequences.set of candidate sequences.Ex. There are 1000 frequent sequences of length-1Ex. There are 1000 frequent sequences of length-1

1000×1000+(1000×999)/2=1,499,500 candidate sequences1000×1000+(1000×999)/2=1,499,500 candidate sequences Many scans of databases in mining.Many scans of databases in mining.

Ex. Sequential pattern {(abc)(abc)(abc)(abc)(abc)}Ex. Sequential pattern {(abc)(abc)(abc)(abc)(abc)}The Apriori-based method must scan the database at The Apriori-based method must scan the database at

least 15 times.least 15 times. Difficulties at mining long sequential patterns.Difficulties at mining long sequential patterns.

Ex. There is only a single sequence of length 100, min_sup=1Ex. There is only a single sequence of length 100, min_sup=1length-1 candidate sequences: 100, length-2: 14950, … length-1 candidate sequences: 100, length-2: 14950, … total = 2^100-1 total = 2^100-1 10^3010^30

Page 4: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 44Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Introduction Introduction (Cont.)(Cont.)

DefinitionDefinition– Sequence, Elements, Subsequence Sequence, Elements, Subsequence

and Sequential Patternand Sequential PatternA sequence : < (ef) (ab) (df) c b >

Elements items within an element are listed alphabetically <a(bc)dc> is a subsequence of <<aa(a(abcbc))(ac)(ac)dd((ccf)>f)>Given support threshold min_sup_count =2, <(ab)c> is a sequential pattern

A sequence database

<eg(af)cbc><eg(af)cbc>4040

<(ef)(<(ef)(abab)(df))(df)ccb>b>3030

<(ad)c(bc)(ae)><(ad)c(bc)(ae)>2020

<a(<a(ababc)(ac)(acc)d(cf)>)d(cf)>1010

sequencesequenceSIDSID

Page 5: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 55Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Introduction Introduction (Cont.)(Cont.)

DefinitionDefinition– Frequent Sequential Pattern Frequent Sequential Pattern (FS)(FS)

Include all the sequences whose Include all the sequences whose support is no less than support is no less than min_supmin_sup

– Closed Frequent Sequential Pattern Closed Frequent Sequential Pattern (CS)(CS)

Include no sequence which has a Include no sequence which has a super-sequence with the same supportsuper-sequence with the same support

CS CS FS FS

Page 6: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 66Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Introduction Introduction (Cont.)(Cont.)

Example – Example – FSFS & & CSCS

IDID SequenceSequence

(af)dea(af)dea

eabeab

e(abf)(bde)e(abf)(bde)

00

11

22

min_sup_countmin_sup_count = 2 = 2

FSFS::

CSCS::

a:3, b:2, d:2, e:3, f:2, ab:2, ad:2,a:3, b:2, d:2, e:3, f:2, ab:2, ad:2,ae:2, (af):2, ea:3, eb:2, fd:2, fe:2,ae:2, (af):2, ea:3, eb:2, fd:2, fe:2,(af)d:2, (af)e:2, eab:2(af)d:2, (af)e:2, eab:2

ea:3, (af)d:2, (af)e:2, eab:2ea:3, (af)d:2, (af)e:2, eab:2

Page 7: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 77Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Introduction Introduction (Cont.)(Cont.)

DefinitionDefinition– Prefix and Postfix (Projection)Prefix and Postfix (Projection)

<a>, <aa>, <a(ab)> and <a(abc)> <a>, <aa>, <a(ab)> and <a(abc)> are are prefixesprefixes of sequence <a(abc) of sequence <a(abc)(ac)d(cf)>(ac)d(cf)>

Given sequence <a(abc)(ac)d(cf)>Given sequence <a(abc)(ac)d(cf)>PrefixPrefix PostfixPostfix / /ProjectionProjection

<a><a> <(abc)(ac)d(cf)><(abc)(ac)d(cf)>

<aa><aa> <(_bc)(ac)d(cf)><(_bc)(ac)d(cf)>

<ab><ab> <(_c)(ac)d(cf)><(_c)(ac)d(cf)>

Page 8: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 88Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Introduction Introduction (Cont.)(Cont.)

DefinitionDefinition– sequence sequence s = <ts = <t11, t, t22, …, t, …, tmm>>

– an item an item – I-Step extensionI-Step extension

s s ii = <t = <t11, t, t22, …, t, …, tmm { {}>}> Ex: <(ae)> is an I-Step extension of <(a)>Ex: <(ae)> is an I-Step extension of <(a)>

– S-Step extensionS-Step extension s s ss = <t = <t11, t, t22, …, t, …, tmm, {, {}>}> Ex: <(a)(e)> is an S-Step extension of Ex: <(a)(e)> is an S-Step extension of

<(a)><(a)>

Page 9: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 99Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Introduction Introduction (Cont.)(Cont.)

DefinitionDefinition– Prefix Search TreePrefix Search Tree

<><>aass

bbii

aass bbss

aass

bbss

bbss

ddiiccii

<><>

<(a)><(a)> <(b)><(b)>

<(ab)><(ab)><(a)(a)><(a)(a)><(a)(b)><(a)(b)>

<(ab)(a)><(ab)(a)> <(ab)(b)><(ab)(b)> <(a)(bc)><(a)(bc)><(a)(bd)><(a)(bd)>

Page 10: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 1010Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space PruningSearch Space PruningDefinitionDefinition

– Common PrefixCommon Prefix ExampleExample

– DDss = {de(af), de(fg)} = {de(af), de(fg)}

– s s <de> <de> not closed not closed unnecessary to extend unnecessary to extend s s <e> <e>

– Partial OrderPartial Order ExampleExample

– Before projecting Before projecting DD into into DDa a , D, Db b , D, Dd d , D, De e , D, Dff

– aa is always before the is always before the ff in all the sequences in all the sequences– Need not search any sequence beginning with Need not search any sequence beginning with ff

Page 11: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 1111Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space Pruning Search Space Pruning (Cont.)(Cont.)

DefinitionDefinition (D)(D)

Total number of items in Total number of items in DD

– Equivalence of Projected Equivalence of Projected DatabaseDatabase

Two sequences Two sequences ss and and s’s’, , s s s’ s’ DDss = D = Ds’s’ (D(Dss) = ) = (D(Ds’s’)) ExampleExample

– DD(af)(af) = D = Dff = {de, (de)} = {de, (de)}

(D(D(af)(af))) = = (D(Dff)) = 4 = 4

IDID SequenceSequence(af)dea(af)deaeabeabe(abf)(bde)e(abf)(bde)

001122

Page 12: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 1212Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space Pruning Search Space Pruning (Cont.)(Cont.)

DefinitionDefinition– Early Termination by EquivalenceEarly Termination by Equivalence

Two sequences Two sequences ss and and s’,s’, s s s’ s’ And also And also (D(Dss) = ) = (D(Ds’s’)) Then Then , , support(s support(s ) = support(s’ ) = support(s’

)) ExampleExample

(D(D(af)(af))) = = (D(Dff))

– (af)d & (af)e are frequent(af)d & (af)e are frequent– support((af)d) = support(fd)support((af)d) = support(fd)– support((af)e) = support(fe)support((af)e) = support(fe)– don’t know the support of don’t know the support of fdfd and and fefe

Page 13: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 1313Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space Pruning Search Space Pruning (Cont.)(Cont.)

DefinitionDefinition– Backward Sub-PatternBackward Sub-Pattern

sequence sequence s < s’s < s’ and and s s s’ s’ (D(Dss) = ) = (D(Ds’s’)) Stop searching any descendant of Stop searching any descendant of s’s’

in the prefix search treein the prefix search tree

aa

ff

ffss s’s’

aa

ff ff

Page 14: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 1414Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space Pruning Search Space Pruning (Cont.)(Cont.)

DefinitionDefinition– Backward Super-PatternBackward Super-Pattern

sequence sequence s < s’s < s’ and and s s s’ s’ (D(Dss) = ) = (D(Ds’s’)) Transplanting the descendants of Transplanting the descendants of ss to to s’s’

instead of searching any descendant of instead of searching any descendant of s’s’ in the prefix search treein the prefix search tree

bb

bb

eess s’s’

bb bb

ee

Page 15: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 1515Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space Pruning Search Space Pruning (Cont.)(Cont.)

DefinitionDefinition– Partial Prefix Sequence LatticePartial Prefix Sequence Lattice

Search spaceSearch space

<><>

ffii

ffss aass eess

bbss

bbss

aass bbss

bbss

ddss eess

(D(Debeb) = ) = (D(Dbb))

(D(Deabeab)) = = (D(Dabab))

(D(Dafaf)) = = (D(Dff))

Page 16: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 1616Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpanCloSpan CloSpan(CloSpan(s s , , DDs s , , min_sup min_sup , , LL))

– Input: A sequence Input: A sequence ss, a projectd DB , a projectd DB DDs s , and , and min_supmin_sup– Output: The prefix search lattice Output: The prefix search lattice LL– Check whether a discovered sequence Check whether a discovered sequence s’s’ exist exist

s.t. either s s.t. either s s’ or s’ or s’ s’ s s, and , and (D(Dss) = ) = (D(Ds’s’););– if such super-pattern or sub-pattern exists if such super-pattern or sub-pattern exists

thenthen Modify the link in Modify the link in LL, return;, return;

– else insert else insert s s intointo L L; ; – scan scan DDss once, find every frequent item once, find every frequent item such that such that

s s can be extended to can be extended to (s (s ii )), or, or s s can be extended to can be extended to (s (s ss ));;

– if no valid if no valid available then available then return;return;

– for each valid for each valid do do I-Step I-Step Call CloSpan(Call CloSpan(s s ii , D , Dss ii , min_sup , L , min_sup , L ););

– for each valid for each valid do do S-Step S-Step Call CloSpan(Call CloSpan(s s ss , D , Dss s s , min_sup , L , min_sup , L ););

– return;return;

Page 17: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 1717Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Hash for Fast Condition CheckingHash for Fast Condition Checking

<><>

ffii

aass eess

bbss

aass

ddss eess

Hash Table: <key, s>Hash Table: <key, s>

nilnil

nilnil

< < (D(Dss)) , s > , s >

Page 18: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 1818Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)ExampleExample

IDID SequenceSequence

(af)dea(af)dea

eabeab

e(abf)(bde)e(abf)(bde)

00

11

22

min_sup_count = 2min_sup_count = 2Hash Function Hash Function Mod 4 Mod 4

a:3, b:2, d:2, e:3, f:2a:3, b:2, d:2, e:3, f:2

Page 19: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 1919Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)Example Example (Cont.)(Cont.)

DDaa

DDbb

DDdd

DDee

DDff

(_f)dea, b, (_bf)(bde)(_f)dea, b, (_bf)(bde)

(_f)(bde)(_f)(bde)

ea, (_e)ea, (_e)

a, ab, (abf)(bde)a, ab, (abf)(bde)

dea, (bde)dea, (bde)

<><>00

11

22

33

nilnil

nilnil

nilnil

nilnil

(_f)de, b, (_f)(bde)(_f)de, b, (_f)(bde) 88

(D(Dss)) DDaa

(_f):2, b:2, d:2, e:2(_f):2, b:2, d:2, e:2

a:3, b:2a:3, b:2

66DDee

a, ab, (ab)ba, ab, (ab)b

(D(Dss))

de, (de)de, (de) 44DDff

d:2, e:2d:2, e:2 (D(Dss))

XX 00

(D(Dss)) DDbb

XX

XX 00

(D(Dss)) DDdd

XX

Page 20: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 2020Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

<><>00

11

22

33

88 nilnil

aass:3:3

(_f)de, b, (_f)(bde)(_f)de, b, (_f)(bde) 88

(D(Dss)) DDaa

(_f):2, b:2, d:2, e:2(_f):2, b:2, d:2, e:2

00

Mod 4Mod 4

Page 21: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 2121Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

DD(af)(af) de, (bde)de, (bde)

DDabab dede

DDadad e, ee, e

DDaeae

de, (de)de, (de) 44

(D(Dss)) DD(af)(af)

d:2, e:2d:2, e:2

XX 00

(D(Dss)) DDabab

XX

e, ee, e 22

(D(Dss)) DDadad

e:2e:2

XX 00

(D(Dss)) DDaeae

XX

Page 22: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 2222Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

de, (de)de, (de) 44

(D(Dss)) DD(af)(af)

d:2, e:2d:2, e:2

00

Mod 4Mod 4

<><>00

11

22

33

88 nilnil

aass:3:3

44

ffii:2:2

Page 23: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 2323Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

DD(af)d(af)d e, (_e)e, (_e)

DD(af)e(af)e

XX 00

(D(Dss)) DD(af)d(af)d

XX

XX 00

(D(Dss)) DD(af)e(af)e

XX

Page 24: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 2424Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

XX 00

(D(Dss)) DD(af)d(af)d

XX

00

Mod 4Mod 4

<><>00

11

22

33

88 00aass:3:3

44

ffii:2:2

nilnil

ddss:2:2

Page 25: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 2525Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

XX 00

(D(Dss)) DD(af)e(af)e

XX

00

Mod 4Mod 4

<><>00

11

22

33

88 00aass:3:3

44

ffii:2:2

nilnil

ddss:2:2

00

eess:2:2

Page 26: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 2626Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

<><>00

11

22

33

88 00aass:3:3

44

ffii:2:2

nilnil

ddss:2:2

00

eess:2:2

00

bbss:2:2

XX 00

(D(Dss)) DDabab

XX

00

Mod 4Mod 4

Page 27: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 2727Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

XX 00

(D(Dss)) DDbb

XX

00

Mod 4Mod 4

<><>00

11

22

33

88 00aass:3:3

44

ffii:2:2

nilnil

ddss:2:2

00

eess:2:2

00

bbss:2:2

Page 28: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 2828Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

XX 00

(D(Dss)) DDdd

XX

00

Mod 4Mod 4

<><>00

11

22

33

88 00aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:2

nilnil

Page 29: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 2929Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

a, ab, (ab)ba, ab, (ab)b 66

(D(Dss)) DDee

a:3, b:2a:3, b:2

22

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:2

nilnil66

eess:3:3

nilnil

Page 30: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 3030Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

DDeaea b, (_b)bb, (_b)bb, bb, b 22

(D(Dss)) DDeaea

b:2b:2

XX 00

(D(Dss)) DDebeb

XXDDebeb

Page 31: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 3131Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

b, bb, b 22

(D(Dss)) DDeaea

b:2b:2

22

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:2

2266

eess:3:3

nilnilaass:3:3

nilnil

Page 32: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 3232Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

DDeabeab

XX 00

(D(Dss)) DDeabeab

XX

00

Mod 4Mod 4

Page 33: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 3333Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

XX 00

(D(Dss)) DDeabeab

XX

00

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

nilnil

Page 34: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 3434Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

XX 00

(D(Dss)) DDeabeab

XX

00

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

bbss:2:2

nilnil

Page 35: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 3535Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

XX 00

(D(Dss)) DDebeb

XX

00

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

bbss:2:2

nilnil

Page 36: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 3636Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

XX 00

(D(Dss)) DDebeb

XX

00

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

bbss:2:2

nilnil

Page 37: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 3737Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

bbss:2:2

nilnilde, (de)de, (de) 44

DDff

d:2, e:2d:2, e:2 (D(Dss))

00

Mod 4Mod 4

Page 38: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 3838Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

de, (de)de, (de) 44DDff

d:2, e:2d:2, e:2 (D(Dss))

00

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

bbss:2:2

nilnil

Page 39: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 3939Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

<><>aass:3:3

ffii:2:2

ddss:2:2eess:2:2

bbss:2:2

eess:3:3

aass:3:3

bbss:2:2

(af)d:2(af)d:2 (af)e:2(af)e:2 eab:2eab:2

ea:3ea:3

Page 40: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 4040Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Experimental ResultsExperimental Results Synthetic DataSynthetic Data

– ParametersParameters D : Number of sequences in 000sD : Number of sequences in 000s C : Average itemsets per sequenceC : Average itemsets per sequence T : Average items per itemsetT : Average items per itemset N : Number of different items in 000sN : Number of different items in 000s S : Average itemsets in maximal sequencesS : Average itemsets in maximal sequences I : Average items in maximal sequencesI : Average items in maximal sequences

– Two Data SetTwo Data Set D10 C10 T2.5 N10 S6 I2.5D10 C10 T2.5 N10 S6 I2.5 D5 C20 T20 N10 S20 I20D5 C20 T20 N10 S20 I20

Real world datasetsReal world datasets– KDDCup2000 – Gazelle Click StreamKDDCup2000 – Gazelle Click Stream

Page 41: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 4141Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Experimental Results Experimental Results (Cont.)(Cont.)Synthetic DataSynthetic Data

D10 C10 T2.5 N10 S6 I2.5D10 C10 T2.5 N10 S6 I2.5

Page 42: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 4242Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Experimental Results Experimental Results (Cont.)(Cont.)Synthetic DataSynthetic Data

D5 C20 T20 N10 S20 I20D5 C20 T20 N10 S20 I20

Page 43: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 4343Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Experimental Results Experimental Results (Cont.)(Cont.)

Real world datasetsReal world datasets– KDDCup2000KDDCup2000

29,369 sequences29,369 sequences 35,722 sessions35,722 sessions 87,546 page views87,546 page views The average number of sessions in a sequence The average number of sessions in a sequence

is around 1is around 1 The average number of pageviews in a session The average number of pageviews in a session

is 2is 2 The largest session contains 342 viewsThe largest session contains 342 views The longest sequence has 140 sessionsThe longest sequence has 140 sessions The largest sequence contains 651 page viewsThe largest sequence contains 651 page views

Page 44: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 4444Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Experimental Results Experimental Results (Cont.)(Cont.)

Page 45: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 4545Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

ConclusionsConclusionsClospan to mine frequent closed Clospan to mine frequent closed

sequences efficiently.sequences efficiently.Clospan outperforms PrefixSpan.Clospan outperforms PrefixSpan.

Page 46: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 4646Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Lexicographic OrderLexicographic OrderDefinitionDefinition

– Lexicographic OrderLexicographic Order t = {it = {i11, i, i22, …,i, …,ikk}, i}, i11 i i22 … … i ikk

t’ = {jt’ = {j11, j, j22, …,j, …,jll}, j}, j11 j j22 … … j jll

t<t’t<t’ iff either of the following is true: iff either of the following is true:– For some For some hh, , 0 0 h h min{k,l} min{k,l}, we have , we have iirr

= j= jrr for for r < hr < h, and , and iihh < j < jhh, or, or

– k < lk < l, and , and ii11 = j = j11, i, i22 = j = j22, …,i, …,ikk = j = jkk

ExampleExample– (a,f) < (b,f)(a,f) < (b,f)– (a,b) < (a,b,c)(a,b) < (a,b,c)– (a,b,c) < (b,c)(a,b,c) < (b,c)

Page 47: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 4747Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Sequence Lexicographic Sequence Lexicographic OrderOrder

DefinitionDefinition– Sequence Lexicographic OrderSequence Lexicographic Order

If If s’ = s s’ = s p, p, then s < s’then s < s’ If If s = s = ii p p and and s’ = s’ = ss p’ p’ , no matter what the , no matter what the

order relation between order relation between pp and and p’p’ is, is, s < s’s < s’ If s = If s = ii p p and and s’ = s’ = ii p’ , p<p’ p’ , p<p’ , indicates , indicates s<s’s<s’ If s = If s = ss p p and and s’ = s’ = ss p’ , p<p’, p’ , p<p’, indicates indicates s<s’s<s’ ExampleExample

– (ab) < (ab)(a)(ab) < (ab)(a)– (ac) < (a)(d), (ad) < (a)(c)(ac) < (a)(d), (ad) < (a)(c)– (ab) < (ac)(ab) < (ac)– (a)(b) < (a)(c)(a)(b) < (a)(c)

Page 48: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 4848Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Lexicographic Sequence TreeLexicographic Sequence Tree

DefinitionDefinition– Lexicographic Sequence TreeLexicographic Sequence Tree

<><>

<(a)><(a)> <(b)><(b)>

<(ab)><(ab)> <(a)(a)><(a)(a)> <(a)(b)><(a)(b)>

<(ab)(a)><(ab)(a)> <(ab)(b)><(ab)(b)> <(a)(bc)><(a)(bc)> <(a)(bd)><(a)(bd)>

Page 49: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 4949Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space PruningSearch Space PruningDefinitionDefinition

– Common PrefixCommon Prefix a subsequence a subsequence ss, projected database , projected database DDss

if if , , is a is a common prefixcommon prefix for all the for all the sequence with the same extension type sequence with the same extension type (either (either itemset-extensionitemset-extension or or sequence-sequence-extensionextension) in ) in DDss

, if , if s s is closed, is closed, must be a prefix of must be a prefix of , we need not search , we need not search s s and its and its

descendants except the branch of descendants except the branch of s s ExampleExample

– DDss = {de(af), de(fg)} = {de(af), de(fg)}– s s <de> <de> not closed not closed unnecessary to extend unnecessary to extend s s

<e><e>

Page 50: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 5050Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space Pruning Search Space Pruning (Cont.)(Cont.)

CommonPrefixCommonPrefix– An intermediate algorithmAn intermediate algorithm– Developed which adopts the Developed which adopts the

PrefixSpan framework plus the PrefixSpan framework plus the common prefix pruning technique common prefix pruning technique

– Outperforms PrefixSpanOutperforms PrefixSpan

Page 51: CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

SlideSlide - - 5151Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space Pruning Search Space Pruning (Cont.)(Cont.)

DefinitionDefinition– Partial OrderPartial Order

A sequence s, projected database A sequence s, projected database DDss

if among all the sequences in if among all the sequences in DDs s , an item , an item does always occur before an item does always occur before an item (either in (either in the same itemset for all sequences in the same itemset for all sequences in DDss or in a or in a different itemset but not both), then different itemset but not both), then DDss = D = Dss

, , ss is not closed. Need not search any is not closed. Need not search any sequence in the branch of sequence in the branch of ss

ExampleExample– Before projecting Before projecting DD into into DDa a , D, Db b , D, Dd d , D, De e , D, Dff

– aa is always before the is always before the ff in all the sequences in all the sequences– Need not search any sequence beginning with Need not search any sequence beginning with ff