CloSpan: Mining Closed CloSpan: Mining Closed Sequential Patterns in Large Sequential Patterns in Large Datasets Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Proceedings of 2003 SIAM International Conference on Data Mining (SDM'03), pp. 166- Conference on Data Mining (SDM'03), pp. 166- 177, San Fransisco, CA, May 2003. 177, San Fransisco, CA, May 2003. Advisor: Professor Hsin-Hsi Chen Advisor: Professor Hsin-Hsi Chen Reporter: Clarence Min-Chi Hsieh Reporter: Clarence Min-Chi Hsieh Natural Language Processing Laboratory, Natural Language Processing Laboratory, Dept. of Computer Science and Info. Engineering, NTU Dept. of Computer Science and Info. Engineering, NTU 2006/01/10 2006/01/10
51
Embed
CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CloSpan: Mining Closed CloSpan: Mining Closed Sequential Patterns in Large Sequential Patterns in Large
DatasetsDatasets
Xifeng Yan, Jiawei Han and Ramin AfsharXifeng Yan, Jiawei Han and Ramin Afshar
Proceedings of 2003 SIAM International Proceedings of 2003 SIAM International Conference on Data Mining (SDM'03), pp. 166-Conference on Data Mining (SDM'03), pp. 166-
177, San Fransisco, CA, May 2003.177, San Fransisco, CA, May 2003.
Advisor: Professor Hsin-Hsi ChenAdvisor: Professor Hsin-Hsi ChenReporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh
Natural Language Processing Laboratory,Natural Language Processing Laboratory,Dept. of Computer Science and Info. Engineering, NTUDept. of Computer Science and Info. Engineering, NTU
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
OutlineOutlineIntroductionIntroductionSearch Space PruningSearch Space PruningCloSpanCloSpanExperimental ResultsExperimental ResultsConclusionsConclusions
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
IntroductionIntroduction Apriori-like algorithm will generate a huge Apriori-like algorithm will generate a huge
set of candidate sequences.set of candidate sequences.Ex. There are 1000 frequent sequences of length-1Ex. There are 1000 frequent sequences of length-1
1000×1000+(1000×999)/2=1,499,500 candidate sequences1000×1000+(1000×999)/2=1,499,500 candidate sequences Many scans of databases in mining.Many scans of databases in mining.
Ex. Sequential pattern {(abc)(abc)(abc)(abc)(abc)}Ex. Sequential pattern {(abc)(abc)(abc)(abc)(abc)}The Apriori-based method must scan the database at The Apriori-based method must scan the database at
least 15 times.least 15 times. Difficulties at mining long sequential patterns.Difficulties at mining long sequential patterns.
Ex. There is only a single sequence of length 100, min_sup=1Ex. There is only a single sequence of length 100, min_sup=1length-1 candidate sequences: 100, length-2: 14950, … length-1 candidate sequences: 100, length-2: 14950, … total = 2^100-1 total = 2^100-1 10^3010^30
and Sequential Patternand Sequential PatternA sequence : < (ef) (ab) (df) c b >
Elements items within an element are listed alphabetically <a(bc)dc> is a subsequence of <<aa(a(abcbc))(ac)(ac)dd((ccf)>f)>Given support threshold min_sup_count =2, <(ab)c> is a sequential pattern
– an item an item – I-Step extensionI-Step extension
s s ii = <t = <t11, t, t22, …, t, …, tmm { {}>}> Ex: <(ae)> is an I-Step extension of <(a)>Ex: <(ae)> is an I-Step extension of <(a)>
– S-Step extensionS-Step extension s s ss = <t = <t11, t, t22, …, t, …, tmm, {, {}>}> Ex: <(a)(e)> is an S-Step extension of Ex: <(a)(e)> is an S-Step extension of
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Search Space PruningSearch Space PruningDefinitionDefinition
– Common PrefixCommon Prefix ExampleExample
– DDss = {de(af), de(fg)} = {de(af), de(fg)}
– s s <de> <de> not closed not closed unnecessary to extend unnecessary to extend s s <e> <e>
– Partial OrderPartial Order ExampleExample
– Before projecting Before projecting DD into into DDa a , D, Db b , D, Dd d , D, De e , D, Dff
– aa is always before the is always before the ff in all the sequences in all the sequences– Need not search any sequence beginning with Need not search any sequence beginning with ff
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Search Space Pruning Search Space Pruning (Cont.)(Cont.)
DefinitionDefinition– Early Termination by EquivalenceEarly Termination by Equivalence
Two sequences Two sequences ss and and s’,s’, s s s’ s’ And also And also (D(Dss) = ) = (D(Ds’s’)) Then Then , , support(s support(s ) = support(s’ ) = support(s’
)) ExampleExample
(D(D(af)(af))) = = (D(Dff))
– (af)d & (af)e are frequent(af)d & (af)e are frequent– support((af)d) = support(fd)support((af)d) = support(fd)– support((af)e) = support(fe)support((af)e) = support(fe)– don’t know the support of don’t know the support of fdfd and and fefe
sequence sequence s < s’s < s’ and and s s s’ s’ (D(Dss) = ) = (D(Ds’s’)) Transplanting the descendants of Transplanting the descendants of ss to to s’s’
instead of searching any descendant of instead of searching any descendant of s’s’ in the prefix search treein the prefix search tree
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpanCloSpan CloSpan(CloSpan(s s , , DDs s , , min_sup min_sup , , LL))
– Input: A sequence Input: A sequence ss, a projectd DB , a projectd DB DDs s , and , and min_supmin_sup– Output: The prefix search lattice Output: The prefix search lattice LL– Check whether a discovered sequence Check whether a discovered sequence s’s’ exist exist
s.t. either s s.t. either s s’ or s’ or s’ s’ s s, and , and (D(Dss) = ) = (D(Ds’s’););– if such super-pattern or sub-pattern exists if such super-pattern or sub-pattern exists
thenthen Modify the link in Modify the link in LL, return;, return;
– else insert else insert s s intointo L L; ; – scan scan DDss once, find every frequent item once, find every frequent item such that such that
s s can be extended to can be extended to (s (s ii )), or, or s s can be extended to can be extended to (s (s ss ));;
– if no valid if no valid available then available then return;return;
– for each valid for each valid do do I-Step I-Step Call CloSpan(Call CloSpan(s s ii , D , Dss ii , min_sup , L , min_sup , L ););
– for each valid for each valid do do S-Step S-Step Call CloSpan(Call CloSpan(s s ss , D , Dss s s , min_sup , L , min_sup , L ););
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Experimental ResultsExperimental Results Synthetic DataSynthetic Data
– ParametersParameters D : Number of sequences in 000sD : Number of sequences in 000s C : Average itemsets per sequenceC : Average itemsets per sequence T : Average items per itemsetT : Average items per itemset N : Number of different items in 000sN : Number of different items in 000s S : Average itemsets in maximal sequencesS : Average itemsets in maximal sequences I : Average items in maximal sequencesI : Average items in maximal sequences
– Two Data SetTwo Data Set D10 C10 T2.5 N10 S6 I2.5D10 C10 T2.5 N10 S6 I2.5 D5 C20 T20 N10 S20 I20D5 C20 T20 N10 S20 I20
Real world datasetsReal world datasets– KDDCup2000 – Gazelle Click StreamKDDCup2000 – Gazelle Click Stream
Real world datasetsReal world datasets– KDDCup2000KDDCup2000
29,369 sequences29,369 sequences 35,722 sessions35,722 sessions 87,546 page views87,546 page views The average number of sessions in a sequence The average number of sessions in a sequence
is around 1is around 1 The average number of pageviews in a session The average number of pageviews in a session
is 2is 2 The largest session contains 342 viewsThe largest session contains 342 views The longest sequence has 140 sessionsThe longest sequence has 140 sessions The largest sequence contains 651 page viewsThe largest sequence contains 651 page views
t<t’t<t’ iff either of the following is true: iff either of the following is true:– For some For some hh, , 0 0 h h min{k,l} min{k,l}, we have , we have iirr
= j= jrr for for r < hr < h, and , and iihh < j < jhh, or, or
– k < lk < l, and , and ii11 = j = j11, i, i22 = j = j22, …,i, …,ikk = j = jkk
DefinitionDefinition– Sequence Lexicographic OrderSequence Lexicographic Order
If If s’ = s s’ = s p, p, then s < s’then s < s’ If If s = s = ii p p and and s’ = s’ = ss p’ p’ , no matter what the , no matter what the
order relation between order relation between pp and and p’p’ is, is, s < s’s < s’ If s = If s = ii p p and and s’ = s’ = ii p’ , p<p’ p’ , p<p’ , indicates , indicates s<s’s<s’ If s = If s = ss p p and and s’ = s’ = ss p’ , p<p’, p’ , p<p’, indicates indicates s<s’s<s’ ExampleExample
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Search Space PruningSearch Space PruningDefinitionDefinition
– Common PrefixCommon Prefix a subsequence a subsequence ss, projected database , projected database DDss
if if , , is a is a common prefixcommon prefix for all the for all the sequence with the same extension type sequence with the same extension type (either (either itemset-extensionitemset-extension or or sequence-sequence-extensionextension) in ) in DDss
, if , if s s is closed, is closed, must be a prefix of must be a prefix of , we need not search , we need not search s s and its and its
descendants except the branch of descendants except the branch of s s ExampleExample
– DDss = {de(af), de(fg)} = {de(af), de(fg)}– s s <de> <de> not closed not closed unnecessary to extend unnecessary to extend s s
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Search Space Pruning Search Space Pruning (Cont.)(Cont.)
DefinitionDefinition– Partial OrderPartial Order
A sequence s, projected database A sequence s, projected database DDss
if among all the sequences in if among all the sequences in DDs s , an item , an item does always occur before an item does always occur before an item (either in (either in the same itemset for all sequences in the same itemset for all sequences in DDss or in a or in a different itemset but not both), then different itemset but not both), then DDss = D = Dss
, , ss is not closed. Need not search any is not closed. Need not search any sequence in the branch of sequence in the branch of ss
ExampleExample– Before projecting Before projecting DD into into DDa a , D, Db b , D, Dd d , D, De e , D, Dff
– aa is always before the is always before the ff in all the sequences in all the sequences– Need not search any sequence beginning with Need not search any sequence beginning with ff