Efficient Itemset Generator Discovery over a Stream Sliding Window Chuancong Gao , Jianyong Wang Database Laboratory Department of Computer Science and Technology Tsinghua University, Beijing 100084, China C. Gao , J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 1 / 28
58
Embed
CIKM 2009 - Efficient itemset generator discovery over a stream sliding window
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Efficient Itemset Generator Discovery over a StreamSliding Window
Chuancong Gao, Jianyong Wang
Database LaboratoryDepartment of Computer Science and Technology
Tsinghua University, Beijing 100084, China
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 1 / 28
Outline
IntroductionWhat is GeneratorWhy We Need GeneratorsWhat have We done
Related Work
The StreamGen AlgorithmFP-TreeEnumeration TreeADD and REMOVE Operations
Extension for Mining Classification Rules
Evaluation Results
Conclusions
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 2 / 28
Introduction What is Generator
What is Generator
Example:Given the 4 transactions, with the
:::::::::minimum
::::::::support
:::::::::threshold
:::::::::(supmin) of 2.
A B CA D
A B C DA B D
Ø : 4
D : 3C : 2B : 3A : 4
ABD : 2ABC : 2
BD : 2BC : 2AD : 3AC : 2AB : 3
Equivalence Class
Generator ItemsetClosed Itemset
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 3 / 28
Introduction What is Generator
What is Generator
Example:Given the 4 transactions, with the
:::::::::minimum
::::::::support
:::::::::threshold
:::::::::(supmin) of 2.
A B CA D
A B C DA B D
Ø : 4
D : 3C : 2B : 3A : 4
ABD : 2ABC : 2
BD : 2BC : 2AD : 3AC : 2AB : 3
Equivalence Class
Generator ItemsetClosed Itemset
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 3 / 28
Introduction What is Generator
What is Generator
::::::::::::Equivalence
::::::class: All the frequent
::::::::itemsets contained in the same set of
input:::::::::::transactions
:::::::Closed
::::::::Itemset: The maximal one in equivalence class
::::::::::Generator
:::::::::Itemsets: The minimal ones
Characteristics:
I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same
:::::::support value and
::::::::::confidence value;
I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;
I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;
I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.
I An itemset could be both a generator itemset and a closed itemset.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction What is Generator
What is Generator
::::::::::::Equivalence
::::::class: All the frequent
::::::::itemsets contained in the same set of
input:::::::::::transactions
:::::::Closed
::::::::Itemset: The maximal one in equivalence class
::::::::::Generator
:::::::::Itemsets: The minimal ones
Characteristics:
I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same
:::::::support value and
::::::::::confidence value;
I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;
I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;
I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.
I An itemset could be both a generator itemset and a closed itemset.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction What is Generator
What is Generator
::::::::::::Equivalence
::::::class: All the frequent
::::::::itemsets contained in the same set of
input:::::::::::transactions
:::::::Closed
::::::::Itemset: The maximal one in equivalence class
::::::::::Generator
:::::::::Itemsets: The minimal ones
Characteristics:
I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same
:::::::support value and
::::::::::confidence value;
I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;
I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;
I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.
I An itemset could be both a generator itemset and a closed itemset.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction What is Generator
What is Generator
::::::::::::Equivalence
::::::class: All the frequent
::::::::itemsets contained in the same set of
input:::::::::::transactions
:::::::Closed
::::::::Itemset: The maximal one in equivalence class
::::::::::Generator
:::::::::Itemsets: The minimal ones
Characteristics:
I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same
:::::::support value and
::::::::::confidence value;
I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;
I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;
I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.
I An itemset could be both a generator itemset and a closed itemset.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction What is Generator
What is Generator
::::::::::::Equivalence
::::::class: All the frequent
::::::::itemsets contained in the same set of
input:::::::::::transactions
:::::::Closed
::::::::Itemset: The maximal one in equivalence class
::::::::::Generator
:::::::::Itemsets: The minimal ones
Characteristics:
I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same
:::::::support value and
::::::::::confidence value;
I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;
I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;
I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.
I An itemset could be both a generator itemset and a closed itemset.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction What is Generator
What is Generator
::::::::::::Equivalence
::::::class: All the frequent
::::::::itemsets contained in the same set of
input:::::::::::transactions
:::::::Closed
::::::::Itemset: The maximal one in equivalence class
::::::::::Generator
:::::::::Itemsets: The minimal ones
Characteristics:
I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same
:::::::support value and
::::::::::confidence value;
I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;
I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;
I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.
I An itemset could be both a generator itemset and a closed itemset.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28
Introduction Why We Need Generators
Why We Need Generators
I Form a concise representation of equivalence classes together withclosed item-sets;
I As classification rules / features.
I At least one generator sharing the same support and confidence withothers for each equivalence class;
I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;I The average size tends to be the smallest;I Preferred by
:::::MDL
::::::::::(Minimum
:::::::::::Description
::::::::Length) principle.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
I Form a concise representation of equivalence classes together withclosed item-sets;
I As classification rules / features.
I At least one generator sharing the same support and confidence withothers for each equivalence class;
I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;I The average size tends to be the smallest;I Preferred by
:::::MDL
::::::::::(Minimum
:::::::::::Description
::::::::Length) principle.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
I Form a concise representation of equivalence classes together withclosed item-sets;
I As classification rules / features.I At least one generator sharing the same support and confidence with
others for each equivalence class;
I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;I The average size tends to be the smallest;I Preferred by
:::::MDL
::::::::::(Minimum
:::::::::::Description
::::::::Length) principle.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
I Form a concise representation of equivalence classes together withclosed item-sets;
I As classification rules / features.I At least one generator sharing the same support and confidence with
others for each equivalence class;I The number is much smaller than all frequent ones;
I The shortest ones in an equivalence class;I The average size tends to be the smallest;I Preferred by
:::::MDL
::::::::::(Minimum
:::::::::::Description
::::::::Length) principle.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
I Form a concise representation of equivalence classes together withclosed item-sets;
I As classification rules / features.I At least one generator sharing the same support and confidence with
others for each equivalence class;I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;
I The average size tends to be the smallest;I Preferred by
:::::MDL
::::::::::(Minimum
:::::::::::Description
::::::::Length) principle.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
I Form a concise representation of equivalence classes together withclosed item-sets;
I As classification rules / features.I At least one generator sharing the same support and confidence with
others for each equivalence class;I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;I The average size tends to be the smallest;
I Preferred by:::::MDL
::::::::::(Minimum
:::::::::::Description
::::::::Length) principle.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction Why We Need Generators
Why We Need Generators
I Form a concise representation of equivalence classes together withclosed item-sets;
I As classification rules / features.I At least one generator sharing the same support and confidence with
others for each equivalence class;I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;I The average size tends to be the smallest;I Preferred by
:::::MDL
::::::::::(Minimum
:::::::::::Description
::::::::Length) principle.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28
Introduction What have We done
What have We done
A novel algorithm to mine frequent generator itemsets on stream slidingwindow.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
Introduction What have We done
What have We done
A novel algorithm to mine frequent generator itemsets on stream slidingwindow.
Contributions:
I First algorithm mining frequent itemset generators over stream slidingwindows;
I Novel::::::::::::enumeration
:::::tree structure and some effective optimization
techniques;
I Extended to directly mine classification rules on a sliding window;
I An extensive performance study shows StreamGen outperforms othersperforming similar tasks, and achieves high classification accuracy.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
Introduction What have We done
What have We done
A novel algorithm to mine frequent generator itemsets on stream slidingwindow.
Contributions:
I First algorithm mining frequent itemset generators over stream slidingwindows;
I Novel::::::::::::enumeration
:::::tree structure and some effective optimization
techniques;
I Extended to directly mine classification rules on a sliding window;
I An extensive performance study shows StreamGen outperforms othersperforming similar tasks, and achieves high classification accuracy.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
Introduction What have We done
What have We done
A novel algorithm to mine frequent generator itemsets on stream slidingwindow.
Contributions:
I First algorithm mining frequent itemset generators over stream slidingwindows;
I Novel::::::::::::enumeration
:::::tree structure and some effective optimization
techniques;
I Extended to directly mine classification rules on a sliding window;
I An extensive performance study shows StreamGen outperforms othersperforming similar tasks, and achieves high classification accuracy.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
Introduction What have We done
What have We done
A novel algorithm to mine frequent generator itemsets on stream slidingwindow.
Contributions:
I First algorithm mining frequent itemset generators over stream slidingwindows;
I Novel::::::::::::enumeration
:::::tree structure and some effective optimization
techniques;
I Extended to directly mine classification rules on a sliding window;
I An extensive performance study shows StreamGen outperforms othersperforming similar tasks, and achieves high classification accuracy.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28
Related Work
Related Work
Itemset Mining Algorithms:
I Mining frequent patterns without candidate generation: A frequent-pattern tree approach.J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004.
I Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R.Mao. SIGMOD Workshop DMKD, 2000.
I Minimum description length principle: Generators are preferable to closed patterns. J. Li,H. Li, L. Wong, J. Pei, and G. Dong. AAAI, 2006.
I Mining statistically important equivalence classes and delta-discriminative emergingpatterns. J. Li, G. Liu, and L. Wong. SIGKDD, 2007.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28
Related Work
Related Work
Itemset Mining Algorithms:
I Mining frequent patterns without candidate generation: A frequent-pattern tree approach.J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004.
I Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R.Mao. SIGMOD Workshop DMKD, 2000.
I Minimum description length principle: Generators are preferable to closed patterns. J. Li,H. Li, L. Wong, J. Pei, and G. Dong. AAAI, 2006.
I Mining statistically important equivalence classes and delta-discriminative emergingpatterns. J. Li, G. Liu, and L. Wong. SIGKDD, 2007.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28
Related Work
Related Work
Itemset Mining Algorithms:
I Mining frequent patterns without candidate generation: A frequent-pattern tree approach.J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004.
I Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R.Mao. SIGMOD Workshop DMKD, 2000.
I Minimum description length principle: Generators are preferable to closed patterns. J. Li,H. Li, L. Wong, J. Pei, and G. Dong. AAAI, 2006.
I Mining statistically important equivalence classes and delta-discriminative emergingpatterns. J. Li, G. Liu, and L. Wong. SIGKDD, 2007.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28
Related Work
Related Work
Itemset Mining Algorithms:
I Mining frequent patterns without candidate generation: A frequent-pattern tree approach.J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004.
I Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R.Mao. SIGMOD Workshop DMKD, 2000.
I Minimum description length principle: Generators are preferable to closed patterns. J. Li,H. Li, L. Wong, J. Pei, and G. Dong. AAAI, 2006.
I Mining statistically important equivalence classes and delta-discriminative emergingpatterns. J. Li, G. Liu, and L. Wong. SIGKDD, 2007.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28
Related Work
Related Work
Stream Itemset Mining Algorithms:
I Catch the moment: maintaining closed frequent itemsets over a data stream slidingwindow. Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Knowl. Inf. Syst., 2006.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 8 / 28
Related Work
Related Work
Stream Itemset Mining Algorithms:
I Catch the moment: maintaining closed frequent itemsets over a data stream slidingwindow. Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Knowl. Inf. Syst., 2006.
Itemset based Classification Algorithms:
I On mining instance-centric classification rules. J. Wang and G. Karypis. IEEE Trans.Knowl. Data Eng., 2006.
I Discriminative frequent pattern analysis for effective classification. H. Cheng, X. Yan, J.Han, and C.-W. Hsu. ICDE, 2007.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 8 / 28
Related Work
Related Work
Stream Itemset Mining Algorithms:
I Catch the moment: maintaining closed frequent itemsets over a data stream slidingwindow. Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Knowl. Inf. Syst., 2006.
Itemset based Classification Algorithms:
I On mining instance-centric classification rules. J. Wang and G. Karypis. IEEE Trans.Knowl. Data Eng., 2006.
I Discriminative frequent pattern analysis for effective classification. H. Cheng, X. Yan, J.Han, and C.-W. Hsu. ICDE, 2007.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 8 / 28
The StreamGen Algorithm
The StreamGen Algorithm
Details of our algorithm here.
Example:One running example of stream data containing 6 transaction itemsets and withwindow size of 4.
T i m e L i n e
I D I t e m s e t
1
2
3
4
5
6
A B C
A D
A B C D
A B D
B C D
C D
W i n d o w
# 1
W i n d o w
# 2
W i n d o w
# 3
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 9 / 28
The StreamGen Algorithm
The StreamGen Algorithm
Details of our algorithm here.
Example:One running example of stream data containing 6 transaction itemsets and withwindow size of 4.
T i m e L i n e
I D I t e m s e t
1
2
3
4
5
6
A B C
A D
A B C D
A B D
B C D
C D
W i n d o w
# 1
W i n d o w
# 2
W i n d o w
# 3
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 9 / 28
The StreamGen Algorithm
A Few Basic Theorems
TheoremA frequent itemset S is a generator iff there exists no subset with size |S − 1|having the same support with S .
Hint:Can be used to check whether an itemset is a generator easily.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 10 / 28
The StreamGen Algorithm
A Few Basic Theorems
TheoremA frequent itemset S is a generator iff there exists no subset with size |S − 1|having the same support with S .
Hint:Can be used to check whether an itemset is a generator easily.
TheoremAny subset of a generator would be also a generator.
TheoremAny superset of an unpromising itemset must be either unpromising orinfrequent.
Hint:Help define the border between generators and non-generators;
Form the foundation for the enumeration tree.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 10 / 28
The StreamGen Algorithm FP-Tree
FP-Tree
A modified FP-Tree for store and compress transactions in each slidingwindow.
Example:FP-Tree of first sliding window in previous example.
1 A B C2 A D3 A B C D4 A B D
�
D : 3
C : 1
B : 1
A : 1
C : 1
B : 1
A : 1
1 4 3 2
I D T a b l e
A : 1 B : 1
A : 1 H e a
d T
a b l e
A
B
D
C
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 11 / 28
The StreamGen Algorithm FP-Tree
FP-Tree
A modified FP-Tree for store and compress transactions in each slidingwindow.
Example:FP-Tree of first sliding window in previous example.
1 A B C2 A D3 A B C D4 A B D
�
D : 3
C : 1
B : 1
A : 1
C : 1
B : 1
A : 1
1 4 3 2
I D T a b l e
A : 1 B : 1
A : 1 H e a
d T
a b l e
A
B
D
C
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 11 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
To help maintain the information of the mined generators and the borderbetween generators and non-generators.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
To help maintain the information of the mined generators and the borderbetween generators and non-generators.3 types of nodes:
I Infrequent Node;
I Unpromising Node.
I Generator Node.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
To help maintain the information of the mined generators and the borderbetween generators and non-generators.3 types of nodes:
I Infrequent Node;
I Unpromising Node.
I Generator Node.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
To help maintain the information of the mined generators and the borderbetween generators and non-generators.3 types of nodes:
I Infrequent Node;
I Unpromising Node.
I Generator Node.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
To help maintain the information of the mined generators and the borderbetween generators and non-generators.3 types of nodes:
I Infrequent Node;
I Unpromising Node.
I Generator Node.
A hash table is prepared for each level of the enumeration tree toaccelerate the checking operation.
C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28
The StreamGen Algorithm Enumeration Tree
Enumeration Tree
Example:Enumeration tree of first sliding window with minimum support 2