Ambiguous Frequent Itemset Mi Ambiguous Frequent Itemset Mi ning ning and Polynomial Delay Enumera and Polynomial Delay Enumera tion tion May/25/2008 PAKDD 2008 Takeaki Uno Takeaki Uno (1) (1) , Hiroki Arimura , Hiroki Arimura (2) (2) (1) National Institute of Informatics, JAPAN (The Guraduate University for Advanced Science) (2) Hokkaido University, JAPAN
26
Embed
Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration
Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration. Takeaki Uno (1) , Hiroki Arimura (2) (1) National Institute of Informatics, JAPAN (The Guraduate University for Advanced Science) (2) Hokkaido University, JAPAN. May/25/2008 PAKDD 2008. Frequent Pattern Mining. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
(1-a) (1-a) the ratio of included items ≥θ include lose monotonicity; no subset may be frequent in the worst case several heuristic-search-based algorithms
(1-b) (1-b) at most k items are not included include satisfy monotonicity; so many small itemsets are frequent maximal enumeration or complete enumeration with small k
1,22,31,3
θ=66%
Related works 2Related works 2Related works 2Related works 2
(2)(2) find pairs of itemset and transaction set such that few of them do not satisfy inclusion
equivalent to finding dense submatrix, or dense bicluster
so many equivalent patterns will be found
mainly, heuristic search for
finding one such dense substructure
•• ambiguity on the transaction set
an itemset can have many partners
We introduce a new model for (2)(2) to avoid redundancy, and propose an efficient depth-first search type algorithm We introduce a new model for (2)(2) to avoid redundancy,
and propose an efficient depth-first search type algorithm
items
transactions
Average InclusionAverage InclusionAverage InclusionAverage Inclusion
•• inclusion ratio of t for P ⇔ ⇔ | t∩P | / |P|
•• average inclusion ratio of transaction set T for P
⇔ ⇔ average of inclusion ratio over all transactions in T
∑ |t ∩ P| / ( |P| × |T| )
equivalent to dense submatrix/subgraph of transaction-item inclusion matrix/graph
•• For a density threshold θ, maximum co-occurrence size cov(P) of itemset P ⇔⇔ maximum size of transaction set s.t. average inclusion ratio ≥θ
1,3,42,4,51,2
1,3,42,4,51,2
2,350%4,550%1,266%
2,350%4,550%1,266%
Problem DefinitionProblem DefinitionProblem DefinitionProblem Definition
•• For a density threshold θ, the maximum co-occurrence size cov(P) of itemset P ⇔ ⇔ maximum size of transaction set s.t. average inclusion ratio ≥θ •• Ambiguous frequent itemset: itemset P s.t., cov(P) ≥ σ (σ: minimum support)
•• Ambiguous frequent itemsets are not monotone !!
Ambiguous frequent itemset enumeration: the problem of outputting all ambiguous frequent itemsets for given database D, density threshold θ, minimum support σ
Ambiguous frequent itemset enumeration: the problem of outputting all ambiguous frequent itemsets for given database D, density threshold θ, minimum support σ
The goal is to develop an efficient algorithm for this problemThe goal is to develop an efficient algorithm for this problem
Hardness for Branch-and-BoundHardness for Branch-and-BoundHardness for Branch-and-BoundHardness for Branch-and-Bound
•• A straightforward approach to this problem is branch-and-bound
•• In each iteration, divide the problem into two non-empty problems by the inclusion of an item
Checking the existence of ambiguous frequent itemset is NP-comp. (Theorem 1)
Checking the existence of ambiguous frequent itemset is NP-comp. (Theorem 1)
Is This Really Hard?Is This Really Hard?Is This Really Hard?Is This Really Hard?
•• We proved NP-hardness for "very dense graphs"
unclear for middle dense graph
not impossible for polynomial time enumeration
θ= 1
θ= 0
easyeasy
easyeasy
hardhard
????????????????????
polynomial time in (input size) + (output size)polynomial time in (input size) + (output size)
Efficient Algorithm: Idea of Reverse Efficient Algorithm: Idea of Reverse SearchSearch
Efficient Algorithm: Idea of Reverse Efficient Algorithm: Idea of Reverse SearchSearch
•• We don’t use branch and bound, but use reverse search
•• Define an acyclic parent-child relation on all objects to be found
Recursively find children to search, thus an algorithm for finding all children is sufficientRecursively find children to search, thus an algorithm for finding all children is sufficient
objectsobjectsobjectsobjects
Depth-first search on the rooted tree induced by the relationDepth-first search on the rooted tree induced by the relation
Neighboring RelationNeighboring RelationNeighboring RelationNeighboring Relation•• AmbiOcc(P) of an ambiguous frequent itemset P
⇔ ⇔ lexicographically minimum one among transaction sets whose average inclusion ratio ≥θ and size = cov(P)
•• e*(P):e*(P): the item e e in P s.t. # transactions in AmbiOcc(P) including e e is the minimum (ties are broken by taking the minimum index)
Efficient Computation of cov’sEfficient Computation of cov’sEfficient Computation of cov’sEfficient Computation of cov’s
•• For efficient computation, we classify transactions by inclusion ratio
•• When we compute cov(P∪e), we compute the intersection of each group and Occ(e)
inclusion ratio increases, for transactions included in Occ(e)
by moving such transactions, classification for P∪e is obtained
•• This task for all items is done efficiently by Delivery, which takes O(||G||) time where ||G|| is the sum of transaction sizes in group G computation of cov(P∪e) can be done in linear time
0 miss0 miss 1 miss1 miss 2 miss2 miss 3 miss3 miss 4 miss4 miss 5 miss5 miss
Computing AmbiOcc and e*Computing AmbiOcc and e*Computing AmbiOcc and e*Computing AmbiOcc and e*
•• Computation of AmbiOcc(P∪e) needs greedy choice of transactions, in the decreasing order of (inclusion ratio & index)
•• Computation of e*(P∪e) needs intersection of AmbiOcc(P∪e) and Occ(i) for each i∈P Delivery
need O(||D||) time in the worst case
•• However, when cov(P) is small, not so many transactions may be scanned, thus we expect the average computation time is not so long