Top Banner
Mining Frequent Patterns from Large Windows over Data Streams Barzan Mozafari, Hetal Thakkar, and Carlo Zaniolo ICDE 2008 ICDE 2008 Cancun, Mexico Cancun, Mexico
38

Verifying and Mining Frequent Patterns from Large Windows over Data Streams

Jan 30, 2016

Download

Documents

yuma

Verifying and Mining Frequent Patterns from Large Windows over Data Streams. Barzan Mozafari, Hetal Thakkar, and Carlo Zaniolo. ICDE 2008 Cancun, Mexico. Finding Frequent Patterns for Association Rule Mining. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Verifying and Mining Frequent Patterns from Large Windows over Data Streams

Barzan Mozafari,Hetal Thakkar,

and Carlo Zaniolo

ICDE 2008ICDE 2008Cancun, MexicoCancun, Mexico

Page 2: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Finding Frequent Patterns for Association Rule Mining

Given a set of transactions T and a support threshold s, find all patterns with support >= s

Apriori [Agrawal’ 94], FP-growth [Han’ 00] Fast & light algorithms for data streams

More than 30 proposals [Jiang’ 06] For mining windows over streams In particular DSMSs divide windows into panes,

a.k.a. slides As in our Stream Mill Miner system

Page 3: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Moment (Maintaining Closed Frequent Itemsets over a Stream Sliding Window)

Yun Chi, Haixun Wang, Philip S. Yu, Richard R. Muntz Collaboration of UCLA + IBM

Page 4: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Closed Enumeration Tree (CET)

Very similar to FP-tree, except that keeps a dynamic set of items: Closed freq itemsets Boundary itemsets

Page 5: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Moment Algorithm (I)

Hope: In the absence of cocncept drifts, not many changes in status

Maintains two types of boundary nodes;

1. Freq / non-freq

2. Closed / non-closed

Taking specific actions to maintain a shifting boundary whenever a concept shift occurs

Page 6: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Moment Algorithm (II)

Infreq gateway nodes Infreq + its parent freq + result of a candidate join

Unpromising gateway nodes Freq + prefix of a closed w/ same support

Intermiddiate nodes Freq + has a child w/ same supp + not

unpromising Closed nodes

Closed freq

Page 7: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Moment Algorithm (III)

Increments: Add/Delete to/from CET upon

arrival/expiration of each transaction.

Downside: Batch operations not applicable, suffers from

big slide sizes

Advantage: Efficient for small slides

Page 8: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

CanTree [Leung’ 05]

Use a fixed canonical order according to decreasing single freq.

Use a single-round version of FP-growth

Algorithm:

Upon each window move: Add/Remove new/expired trans to/from FP-

tree (using the same item order) Run FP-growth! (Without any pruning)

Page 9: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

CanTree (cont.)

Pros: Very efficient for large slides

Cons: Inefficient for small slides Not scallable for large windows

Needs memory for entire window

Page 10: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Frequent Patterns Mining overData Streams

Challenges Computation Storage Real-time response Customization Integration with the DSMS

S4… ……….S5 S6 S7

W4 W5

Expired New

Page 11: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Frequent Patterns Mining over Data Streams

Difficult problem: [Chi’ 04, Leung’ 05, Cheung’ 03, Koh’ 04, …]

Mining each window from scratch - too expensive Subsequent windows have many freq patterns in common

Updating frequent patterns every new tuple, also too expensive

SWIM’s middle-road approach: incrementally maintain frequent patterns over sliding windows Desiderata: scalability with slide size and window size

Page 12: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

SWIM (Sliding Window Incremental Miner) If pattern p is freq in a window, it must be freq in at least

one of its slides -- keep a union of freq patterns of all slides (PT)

S4… ……….S5 S6 S7

W4 W5

Expired New

PT

PT = F4 U F5 U F6

Count/Update frequencies

Mine

MiningAlg.Add F7 to PT

Count/Update frequencies

Prune PT

PT = F5 U F6 U F7

Page 13: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

SWIM

For each new slide Si Find all frequent patterns in Si (using FP-growth)

Verify frequency of these new patterns in each window slide Immediately or With delay (< N slides) Trade-off: max delay vs. computation.

No false negatives or false positives!

Page 14: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

SWIM – Design Choices

Data Structure for Si’s: FP-tree [Han’ 00] Data Structure for PT: FP-tree Mining Algorithm: FP-growth Count/Update frequencies: Naïve? Hash-

tree? Counting is the bottleneck New and improved counting method named

Conditional Counting

Page 15: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Conditional Counting

Verification Given a set of transactions T, a set of patterns P,

and a threshold s Goal: Find the exact freq of each p P w.r.t. to T,

IF AND ONLY IF its freq is s If s=0, verification = counting, but if s>0 extra

computation can be avoided Proposed fast verifiers

DTV, DFV, hybrid

Page 16: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Conditionalization on FP-trees

FP-tree FP-tree | g FP-tree | gd

Page 17: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Attempt I: DTV (Double-Tree Verifier) Not only conditionalize the fp-tree, but also the

pattern tree

Page 18: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

root:?

b:?

d:?

e:? f:? g:?

g:?

Header Table

a

b

c

d

e

f

g

h Initial pattern tree

root:?

b:?

d:?

Header Table

a

b

c

d

e

f

root:4

b:3

d:2

Header Table

a

b

c

d

e

f

root:?

b:?

d:?

e:? f:? g:2

g:4

Header Table

a

b

c

d

e

f

g

hpattern tree | “g”pattern tree | “g”, after

verification against FP-treeFilling original pattern tree

using reverse pointers

root

a:5

b:5

c:5

d:4

e:1 f:1 g:2

b:1

e:1

g:1

h:1g:1

Header Table

a

b

c

d

e

f

g

h

FP-tree

root

a:3

b:3

c:3

d:2

b:1

e:1

Header Table

a

b

c

d

e

f

FP-tree | g

(a:2,b:2,c:2,d:2)(a:1,b:1,c:1)

(b:1,e:1)Conditional pattern base of “g”

Page 19: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

DTV (cont.)

Scales up well on large trees Much pruning from conditionalization

However, for smaller trees Less pruning Overhead of conditionalization not always worth it

Page 20: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Attempt II: DFV(Depth-First Verifier)

Each node n in PT corresponds to a unique pattern pn, therefore: For each node n in PT

Traverse the FP-tree and count the occurrence of pn in a depth-first order

Keep the nodes marked as FAIL/OK while visiting their children

Utilize these marks for optimized execution

More efficient when both trees are small

Page 21: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

DFV (cont.)

Page 22: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

DFV (cont.)

Page 23: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Comparing Verifiers

Page 24: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Hybrid Verifier

Start with performing DTV recursively

Until the resulting trees are small enough, then perform DFV

Page 25: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Comparing Verifiers

Page 26: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Verifiers vs. Hash Trees (Counting)

Page 27: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

SWIM with Hybrid Verifier (I)

Page 28: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

SWIM with Hybrid Verifier (II)

Page 29: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Applications of Verifiers (I)

Improving counting in static mining methods Candidate-generation (and pruning) phase Example: Toivonen Approach [Toivonen’ 96]

1. Maintain a boundary of smallest non-frequent and largest frequent patterns

2. Check the frequency of boundary patterns

Page 30: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Applications of Verifiers (II)

In case resources are limited1. Mine once

2. Keep monitoring the current patterns (by verifying them)

Since verifying is computationally cheaper

3. Whenever a significant concept shift is detected, mine again!

Page 31: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Monitoring/Concept Shift Detection

Verification is much faster than mining (when it suffices)

Page 32: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Privacy Preserving Applications

Random noise methods: Add many fake items into the transactions to

increase the variance [Evfimievski’ 03] Overhead:

Long transactions (in the order of the no of items)

Lemma: Max depth of the recursion in DTV is <= the max len of the patterns to be verified. Run-time independent of the transaction length

Page 33: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Optimization when integrated into a DSMS

Stream Mill Miner (SMM) provides integrated support for online mining algorithms by User Define Aggregates (UDAs) Definition of Mining Models

Constraints used for optimization Max allowed delay Interesting/Uninteresting items Interesting/Uninteresting patterns

These are turned from post-conditions into pre-conditions

Page 34: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Conclusions

1. SWIM for incremental mining over large windows More efficient than existing approaches on data streams Trade-off between real-time response, efficiency, memory,

etc.

2. Efficient algorithms for verification/conditional counting

DTV, DFV, and Hybrid These can be used to speed-up many applications:

Incremental mining, enhancing static algorithms, privacy preserving techniques, …

Implementations of SWIM and the verifiers available at

http://wis.cs.ucla.edu/swim/index.htm

Page 35: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

References[Agrawal’ 94] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487–499, 1994.

[Cheung’ 03] W. Cheung and O. R. Zaiane, “Incremental mining of frequent patterns without candidate generation or support,” in DEAS, 2003.

[Chi’ 04] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, “Moment: Maintaining closed frequent itemsets over a stream sliding window,” in ICDM, November 2004.

[Evfimievski’ 03] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving data mining,” in PODS, 2003.

[Han’ 00] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, 2000.

[Koh’ 04] J. Koh and S. Shieh, “An efficient approach for maintaining association rules based on adjusting fp-tree structures.” in DASFAA, 2004.

[Leung’ 05] C.-S. Leung, Q. Khan, and T. Hoque, “Cantree: A tree structure for efficient incremental mining of frequent patterns,” in ICDM, 2005.

[Toivonen’ 96] H. Toivonen, “Sampling large databases for association rules,” in VLDB, 1996, pp. 134–145.

Page 36: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

Thank you!

Questions?

Page 37: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

DFV (cont.)

Page 38: Verifying and Mining Frequent Patterns from Large Windows  over Data Streams

DFV (cont.)