CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

CloSpan: Mining Closed CloSpan: Mining Closed Sequential Patterns in Large Sequential Patterns in Large

DatasetsDatasets

Xifeng Yan, Jiawei Han and Ramin AfsharXifeng Yan, Jiawei Han and Ramin Afshar

Proceedings of 2003 SIAM International Proceedings of 2003 SIAM International Conference on Data Mining (SDM'03), pp. 166-Conference on Data Mining (SDM'03), pp. 166-

177, San Fransisco, CA, May 2003.177, San Fransisco, CA, May 2003.

Advisor: Professor Hsin-Hsi ChenAdvisor: Professor Hsin-Hsi ChenReporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh

Natural Language Processing Laboratory,Natural Language Processing Laboratory,Dept. of Computer Science and Info. Engineering, NTUDept. of Computer Science and Info. Engineering, NTU

2006/01/102006/01/10

SlideSlide - - 22Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

OutlineOutlineIntroductionIntroductionSearch Space PruningSearch Space PruningCloSpanCloSpanExperimental ResultsExperimental ResultsConclusionsConclusions



IntroductionIntroduction Apriori-like algorithm will generate a huge Apriori-like algorithm will generate a huge

set of candidate sequences.set of candidate sequences.Ex. There are 1000 frequent sequences of length-1Ex. There are 1000 frequent sequences of length-1

1000×1000+(1000×999)/2=1,499,500 candidate sequences1000×1000+(1000×999)/2=1,499,500 candidate sequences Many scans of databases in mining.Many scans of databases in mining.

Ex. Sequential pattern {(abc)(abc)(abc)(abc)(abc)}Ex. Sequential pattern {(abc)(abc)(abc)(abc)(abc)}The Apriori-based method must scan the database at The Apriori-based method must scan the database at

least 15 times.least 15 times. Difficulties at mining long sequential patterns.Difficulties at mining long sequential patterns.

Ex. There is only a single sequence of length 100, min_sup=1Ex. There is only a single sequence of length 100, min_sup=1length-1 candidate sequences: 100, length-2: 14950, … length-1 candidate sequences: 100, length-2: 14950, … total = 2^100-1 total = 2^100-1 10^3010^30



Introduction Introduction (Cont.)(Cont.)

DefinitionDefinition– Sequence, Elements, Subsequence Sequence, Elements, Subsequence

and Sequential Patternand Sequential PatternA sequence : < (ef) (ab) (df) c b >

Elements items within an element are listed alphabetically <a(bc)dc> is a subsequence of <<aa(a(abcbc))(ac)(ac)dd((ccf)>f)>Given support threshold min_sup_count =2, <(ab)c> is a sequential pattern

A sequence database

<eg(af)cbc><eg(af)cbc>4040

<(ef)(<(ef)(abab)(df))(df)ccb>b>3030

<(ad)c(bc)(ae)><(ad)c(bc)(ae)>2020

<a(<a(ababc)(ac)(acc)d(cf)>)d(cf)>1010

sequencesequenceSIDSID




DefinitionDefinition– Frequent Sequential Pattern Frequent Sequential Pattern (FS)(FS)

Include all the sequences whose Include all the sequences whose support is no less than support is no less than min_supmin_sup

– Closed Frequent Sequential Pattern Closed Frequent Sequential Pattern (CS)(CS)

Include no sequence which has a Include no sequence which has a super-sequence with the same supportsuper-sequence with the same support

CS CS FS FS




Example – Example – FSFS & & CSCS

IDID SequenceSequence

(af)dea(af)dea

eabeab

e(abf)(bde)e(abf)(bde)

00

11

22

min_sup_countmin_sup_count = 2 = 2

FSFS::

CSCS::

a:3, b:2, d:2, e:3, f:2, ab:2, ad:2,a:3, b:2, d:2, e:3, f:2, ab:2, ad:2,ae:2, (af):2, ea:3, eb:2, fd:2, fe:2,ae:2, (af):2, ea:3, eb:2, fd:2, fe:2,(af)d:2, (af)e:2, eab:2(af)d:2, (af)e:2, eab:2

ea:3, (af)d:2, (af)e:2, eab:2ea:3, (af)d:2, (af)e:2, eab:2




DefinitionDefinition– Prefix and Postfix (Projection)Prefix and Postfix (Projection)

<a>, <aa>, <a(ab)> and <a(abc)> <a>, <aa>, <a(ab)> and <a(abc)> are are prefixesprefixes of sequence <a(abc) of sequence <a(abc)(ac)d(cf)>(ac)d(cf)>

Given sequence <a(abc)(ac)d(cf)>Given sequence <a(abc)(ac)d(cf)>PrefixPrefix PostfixPostfix / /ProjectionProjection

<a><a> <(abc)(ac)d(cf)><(abc)(ac)d(cf)>

<aa><aa> <(_bc)(ac)d(cf)><(_bc)(ac)d(cf)>

<ab><ab> <(_c)(ac)d(cf)><(_c)(ac)d(cf)>




DefinitionDefinition– sequence sequence s = <ts = <t11, t, t22, …, t, …, tmm>>

– an item an item – I-Step extensionI-Step extension

s s ii = <t = <t11, t, t22, …, t, …, tmm { {}>}> Ex: <(ae)> is an I-Step extension of <(a)>Ex: <(ae)> is an I-Step extension of <(a)>

– S-Step extensionS-Step extension s s ss = <t = <t11, t, t22, …, t, …, tmm, {, {}>}> Ex: <(a)(e)> is an S-Step extension of Ex: <(a)(e)> is an S-Step extension of

<(a)><(a)>




DefinitionDefinition– Prefix Search TreePrefix Search Tree

<><>aass

bbii

aass bbss

aass

bbss

bbss

ddiiccii

<><>

<(a)><(a)> <(b)><(b)>

<(ab)><(ab)><(a)(a)><(a)(a)><(a)(b)><(a)(b)>

<(ab)(a)><(ab)(a)> <(ab)(b)><(ab)(b)> <(a)(bc)><(a)(bc)><(a)(bd)><(a)(bd)>



Search Space PruningSearch Space PruningDefinitionDefinition

– Common PrefixCommon Prefix ExampleExample

– DDss = {de(af), de(fg)} = {de(af), de(fg)}

– s s <de> <de> not closed not closed unnecessary to extend unnecessary to extend s s <e> <e>

– Partial OrderPartial Order ExampleExample

– Before projecting Before projecting DD into into DDa a , D, Db b , D, Dd d , D, De e , D, Dff

– aa is always before the is always before the ff in all the sequences in all the sequences– Need not search any sequence beginning with Need not search any sequence beginning with ff



Search Space Pruning Search Space Pruning (Cont.)(Cont.)

DefinitionDefinition (D)(D)

Total number of items in Total number of items in DD

– Equivalence of Projected Equivalence of Projected DatabaseDatabase

Two sequences Two sequences ss and and s’s’, , s s s’ s’ DDss = D = Ds’s’ (D(Dss) = ) = (D(Ds’s’)) ExampleExample

– DD(af)(af) = D = Dff = {de, (de)} = {de, (de)}

(D(D(af)(af))) = = (D(Dff)) = 4 = 4

IDID SequenceSequence(af)dea(af)deaeabeabe(abf)(bde)e(abf)(bde)

001122




DefinitionDefinition– Early Termination by EquivalenceEarly Termination by Equivalence

Two sequences Two sequences ss and and s’,s’, s s s’ s’ And also And also (D(Dss) = ) = (D(Ds’s’)) Then Then , , support(s support(s ) = support(s’ ) = support(s’

)) ExampleExample

(D(D(af)(af))) = = (D(Dff))

– (af)d & (af)e are frequent(af)d & (af)e are frequent– support((af)d) = support(fd)support((af)d) = support(fd)– support((af)e) = support(fe)support((af)e) = support(fe)– don’t know the support of don’t know the support of fdfd and and fefe




DefinitionDefinition– Backward Sub-PatternBackward Sub-Pattern

sequence sequence s < s’s < s’ and and s s s’ s’ (D(Dss) = ) = (D(Ds’s’)) Stop searching any descendant of Stop searching any descendant of s’s’

in the prefix search treein the prefix search tree

aa

ff

ffss s’s’

aa

ff ff




DefinitionDefinition– Backward Super-PatternBackward Super-Pattern

sequence sequence s < s’s < s’ and and s s s’ s’ (D(Dss) = ) = (D(Ds’s’)) Transplanting the descendants of Transplanting the descendants of ss to to s’s’

instead of searching any descendant of instead of searching any descendant of s’s’ in the prefix search treein the prefix search tree

bb

bb

eess s’s’

bb bb

ee




DefinitionDefinition– Partial Prefix Sequence LatticePartial Prefix Sequence Lattice

Search spaceSearch space

<><>

ffii

ffss aass eess

bbss

bbss

aass bbss

bbss

ddss eess

(D(Debeb) = ) = (D(Dbb))

(D(Deabeab)) = = (D(Dabab))

(D(Dafaf)) = = (D(Dff))



CloSpanCloSpan CloSpan(CloSpan(s s , , DDs s , , min_sup min_sup , , LL))

– Input: A sequence Input: A sequence ss, a projectd DB , a projectd DB DDs s , and , and min_supmin_sup– Output: The prefix search lattice Output: The prefix search lattice LL– Check whether a discovered sequence Check whether a discovered sequence s’s’ exist exist

s.t. either s s.t. either s s’ or s’ or s’ s’ s s, and , and (D(Dss) = ) = (D(Ds’s’););– if such super-pattern or sub-pattern exists if such super-pattern or sub-pattern exists

thenthen Modify the link in Modify the link in LL, return;, return;

– else insert else insert s s intointo L L; ; – scan scan DDss once, find every frequent item once, find every frequent item such that such that

s s can be extended to can be extended to (s (s ii )), or, or s s can be extended to can be extended to (s (s ss ));;

– if no valid if no valid available then available then return;return;

– for each valid for each valid do do I-Step I-Step Call CloSpan(Call CloSpan(s s ii , D , Dss ii , min_sup , L , min_sup , L ););

– for each valid for each valid do do S-Step S-Step Call CloSpan(Call CloSpan(s s ss , D , Dss s s , min_sup , L , min_sup , L ););

– return;return;



CloSpan CloSpan (Cont.)(Cont.)

Hash for Fast Condition CheckingHash for Fast Condition Checking

<><>

ffii

aass eess

bbss

aass

ddss eess

Hash Table: <key, s>Hash Table: <key, s>

nilnil

nilnil

< < (D(Dss)) , s > , s >



CloSpan CloSpan (Cont.)(Cont.)ExampleExample

IDID SequenceSequence

(af)dea(af)dea

eabeab

e(abf)(bde)e(abf)(bde)

00

11

22

min_sup_count = 2min_sup_count = 2Hash Function Hash Function Mod 4 Mod 4

a:3, b:2, d:2, e:3, f:2a:3, b:2, d:2, e:3, f:2



CloSpan CloSpan (Cont.)(Cont.)Example Example (Cont.)(Cont.)

DDaa

DDbb

DDdd

DDee

DDff

(_f)dea, b, (_bf)(bde)(_f)dea, b, (_bf)(bde)

(_f)(bde)(_f)(bde)

ea, (_e)ea, (_e)

a, ab, (abf)(bde)a, ab, (abf)(bde)

dea, (bde)dea, (bde)

<><>00

11

22

33

nilnil

nilnil

nilnil

nilnil

(_f)de, b, (_f)(bde)(_f)de, b, (_f)(bde) 88

(D(Dss)) DDaa

(_f):2, b:2, d:2, e:2(_f):2, b:2, d:2, e:2

a:3, b:2a:3, b:2

66DDee

a, ab, (ab)ba, ab, (ab)b

(D(Dss))

de, (de)de, (de) 44DDff

d:2, e:2d:2, e:2 (D(Dss))

XX 00

(D(Dss)) DDbb

XX

XX 00

(D(Dss)) DDdd

XX




Example Example (Cont.)(Cont.)

<><>00

11

22

33

88 nilnil

aass:3:3

(_f)de, b, (_f)(bde)(_f)de, b, (_f)(bde) 88

(D(Dss)) DDaa

(_f):2, b:2, d:2, e:2(_f):2, b:2, d:2, e:2

00

Mod 4Mod 4





DD(af)(af) de, (bde)de, (bde)

DDabab dede

DDadad e, ee, e

DDaeae

de, (de)de, (de) 44

(D(Dss)) DD(af)(af)

d:2, e:2d:2, e:2

XX 00

(D(Dss)) DDabab

XX

e, ee, e 22

(D(Dss)) DDadad

e:2e:2

XX 00

(D(Dss)) DDaeae

XX





de, (de)de, (de) 44

(D(Dss)) DD(af)(af)

d:2, e:2d:2, e:2

00

Mod 4Mod 4

<><>00

11

22

33

88 nilnil

aass:3:3

44

ffii:2:2





DD(af)d(af)d e, (_e)e, (_e)

DD(af)e(af)e

XX 00

(D(Dss)) DD(af)d(af)d

XX

XX 00

(D(Dss)) DD(af)e(af)e

XX





XX 00

(D(Dss)) DD(af)d(af)d

XX

00

Mod 4Mod 4

<><>00

11

22

33

88 00aass:3:3

44

ffii:2:2

nilnil

ddss:2:2





XX 00

(D(Dss)) DD(af)e(af)e

XX

00

Mod 4Mod 4

<><>00

11

22

33

88 00aass:3:3

44

ffii:2:2

nilnil

ddss:2:2

00

eess:2:2





<><>00

11

22

33

88 00aass:3:3

44

ffii:2:2

nilnil

ddss:2:2

00

eess:2:2

00

bbss:2:2

XX 00

(D(Dss)) DDabab

XX

00

Mod 4Mod 4





XX 00

(D(Dss)) DDbb

XX

00

Mod 4Mod 4

<><>00

11

22

33

88 00aass:3:3

44

ffii:2:2

nilnil

ddss:2:2

00

eess:2:2

00

bbss:2:2





XX 00

(D(Dss)) DDdd

XX

00

Mod 4Mod 4

<><>00

11

22

33

88 00aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:2

nilnil





a, ab, (ab)ba, ab, (ab)b 66

(D(Dss)) DDee

a:3, b:2a:3, b:2

22

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:2

nilnil66

eess:3:3

nilnil





DDeaea b, (_b)bb, (_b)bb, bb, b 22

(D(Dss)) DDeaea

b:2b:2

XX 00

(D(Dss)) DDebeb

XXDDebeb





b, bb, b 22

(D(Dss)) DDeaea

b:2b:2

22

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:2

2266

eess:3:3

nilnilaass:3:3

nilnil





DDeabeab

XX 00

(D(Dss)) DDeabeab

XX

00

Mod 4Mod 4





XX 00

(D(Dss)) DDeabeab

XX

00

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

nilnil





XX 00

(D(Dss)) DDeabeab

XX

00

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

bbss:2:2

nilnil





XX 00

(D(Dss)) DDebeb

XX

00

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

bbss:2:2

nilnil





XX 00

(D(Dss)) DDebeb

XX

00

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

bbss:2:2

nilnil





<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

bbss:2:2

nilnilde, (de)de, (de) 44

DDff

d:2, e:2d:2, e:2 (D(Dss))

00

Mod 4Mod 4





de, (de)de, (de) 44DDff

d:2, e:2d:2, e:2 (D(Dss))

00

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

bbss:2:2

nilnil





<><>aass:3:3

ffii:2:2

ddss:2:2eess:2:2

bbss:2:2

eess:3:3

aass:3:3

bbss:2:2

(af)d:2(af)d:2 (af)e:2(af)e:2 eab:2eab:2

ea:3ea:3



Experimental ResultsExperimental Results Synthetic DataSynthetic Data

– ParametersParameters D : Number of sequences in 000sD : Number of sequences in 000s C : Average itemsets per sequenceC : Average itemsets per sequence T : Average items per itemsetT : Average items per itemset N : Number of different items in 000sN : Number of different items in 000s S : Average itemsets in maximal sequencesS : Average itemsets in maximal sequences I : Average items in maximal sequencesI : Average items in maximal sequences

– Two Data SetTwo Data Set D10 C10 T2.5 N10 S6 I2.5D10 C10 T2.5 N10 S6 I2.5 D5 C20 T20 N10 S20 I20D5 C20 T20 N10 S20 I20

Real world datasetsReal world datasets– KDDCup2000 – Gazelle Click StreamKDDCup2000 – Gazelle Click Stream



Experimental Results Experimental Results (Cont.)(Cont.)Synthetic DataSynthetic Data

D10 C10 T2.5 N10 S6 I2.5D10 C10 T2.5 N10 S6 I2.5



Experimental Results Experimental Results (Cont.)(Cont.)Synthetic DataSynthetic Data

D5 C20 T20 N10 S20 I20D5 C20 T20 N10 S20 I20



Experimental Results Experimental Results (Cont.)(Cont.)

Real world datasetsReal world datasets– KDDCup2000KDDCup2000

29,369 sequences29,369 sequences 35,722 sessions35,722 sessions 87,546 page views87,546 page views The average number of sessions in a sequence The average number of sessions in a sequence

is around 1is around 1 The average number of pageviews in a session The average number of pageviews in a session

is 2is 2 The largest session contains 342 viewsThe largest session contains 342 views The longest sequence has 140 sessionsThe longest sequence has 140 sessions The largest sequence contains 651 page viewsThe largest sequence contains 651 page views



Experimental Results Experimental Results (Cont.)(Cont.)



ConclusionsConclusionsClospan to mine frequent closed Clospan to mine frequent closed

sequences efficiently.sequences efficiently.Clospan outperforms PrefixSpan.Clospan outperforms PrefixSpan.



Lexicographic OrderLexicographic OrderDefinitionDefinition

– Lexicographic OrderLexicographic Order t = {it = {i11, i, i22, …,i, …,ikk}, i}, i11 i i22 … … i ikk

t’ = {jt’ = {j11, j, j22, …,j, …,jll}, j}, j11 j j22 … … j jll

t<t’t<t’ iff either of the following is true: iff either of the following is true:– For some For some hh, , 0 0 h h min{k,l} min{k,l}, we have , we have iirr

= j= jrr for for r < hr < h, and , and iihh < j < jhh, or, or

– k < lk < l, and , and ii11 = j = j11, i, i22 = j = j22, …,i, …,ikk = j = jkk

ExampleExample– (a,f) < (b,f)(a,f) < (b,f)– (a,b) < (a,b,c)(a,b) < (a,b,c)– (a,b,c) < (b,c)(a,b,c) < (b,c)



Sequence Lexicographic Sequence Lexicographic OrderOrder

DefinitionDefinition– Sequence Lexicographic OrderSequence Lexicographic Order

If If s’ = s s’ = s p, p, then s < s’then s < s’ If If s = s = ii p p and and s’ = s’ = ss p’ p’ , no matter what the , no matter what the

order relation between order relation between pp and and p’p’ is, is, s < s’s < s’ If s = If s = ii p p and and s’ = s’ = ii p’ , p<p’ p’ , p<p’ , indicates , indicates s<s’s<s’ If s = If s = ss p p and and s’ = s’ = ss p’ , p<p’, p’ , p<p’, indicates indicates s<s’s<s’ ExampleExample

– (ab) < (ab)(a)(ab) < (ab)(a)– (ac) < (a)(d), (ad) < (a)(c)(ac) < (a)(d), (ad) < (a)(c)– (ab) < (ac)(ab) < (ac)– (a)(b) < (a)(c)(a)(b) < (a)(c)



Lexicographic Sequence TreeLexicographic Sequence Tree

DefinitionDefinition– Lexicographic Sequence TreeLexicographic Sequence Tree

<><>

<(a)><(a)> <(b)><(b)>

<(ab)><(ab)> <(a)(a)><(a)(a)> <(a)(b)><(a)(b)>

<(ab)(a)><(ab)(a)> <(ab)(b)><(ab)(b)> <(a)(bc)><(a)(bc)> <(a)(bd)><(a)(bd)>



Search Space PruningSearch Space PruningDefinitionDefinition

– Common PrefixCommon Prefix a subsequence a subsequence ss, projected database , projected database DDss

if if , , is a is a common prefixcommon prefix for all the for all the sequence with the same extension type sequence with the same extension type (either (either itemset-extensionitemset-extension or or sequence-sequence-extensionextension) in ) in DDss

, if , if s s is closed, is closed, must be a prefix of must be a prefix of , we need not search , we need not search s s and its and its

descendants except the branch of descendants except the branch of s s ExampleExample

– DDss = {de(af), de(fg)} = {de(af), de(fg)}– s s <de> <de> not closed not closed unnecessary to extend unnecessary to extend s s

<e><e>




CommonPrefixCommonPrefix– An intermediate algorithmAn intermediate algorithm– Developed which adopts the Developed which adopts the

PrefixSpan framework plus the PrefixSpan framework plus the common prefix pruning technique common prefix pruning technique

– Outperforms PrefixSpanOutperforms PrefixSpan




DefinitionDefinition– Partial OrderPartial Order

A sequence s, projected database A sequence s, projected database DDss

if among all the sequences in if among all the sequences in DDs s , an item , an item does always occur before an item does always occur before an item (either in (either in the same itemset for all sequences in the same itemset for all sequences in DDss or in a or in a different itemset but not both), then different itemset but not both), then DDss = D = Dss

, , ss is not closed. Need not search any is not closed. Need not search any sequence in the branch of sequence in the branch of ss

ExampleExample– Before projecting Before projecting DD into into DDa a , D, Db b , D, Dd d , D, De e , D, Dff

– aa is always before the is always before the ff in all the sequences in all the sequences– Need not search any sequence beginning with Need not search any sequence beginning with ff

CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

Documents