Mining Frequent Closed Graphs on Evolving Data Streams A. Bifet, G. Holmes, B. Pfahringer and R. Gavald` a University of Waikato Hamilton, New Zealand Laboratory for Relational Algorithmics, Complexity and Learning LARCA UPC-Barcelona Tech, Catalonia San Diego, 24 August 2011 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2011
72
Embed
Mining Frequent Closed Graphs on Evolving Data Streams
Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this talk we present a framework for studying graph pattern mining on time-varying streams and large datasets.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Mining Frequent Closed Graphs on Evolving Data Streams
A. Bifet, G. Holmes, B. Pfahringer and R. Gavalda
University of WaikatoHamilton, New Zealand
Laboratory for Relational Algorithmics, Complexity and Learning LARCAUPC-Barcelona Tech, Catalonia
San Diego, 24 August 201117th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining 2011
Mining Evolving Graph Data StreamsProblemGiven a data stream D of graphs, find frequent closed graphs.
Transaction Id Graph
1
C C S N
O
O
2
C C S N
O
C
3 C C S N
N
We provide three algorithms,of increasing power
IncrementalSliding WindowAdaptive
2 / 48
Non-incrementalFrequent Closed Graph Mining
CloseGraph: Xifeng Yan, Jiawei Hanright-most extension based on depth-first searchbased on gSpan ICDM’02
MoSS: Christian Borgelt, Michael R. Bertholdbreadth-first searchbased on MoFa ICDM’02
Source: IDC’s Digital Universe Study (EMC), June 2011
6 / 48
Data Streams
Data StreamsSequence is potentially infiniteHigh amount of dataHigh speed of arrivalOnce an element from a data stream has been processedit is discarded or archived
Relative support of a closed graphSupport of a graph minus the relative support of its closedsupergraphs.
The sum of the closed supergraphs’ relative supports of agraph and its relative support is equal to its own support.
(s,δ )-coreset for the problem of computing closedgraphsWeighted multiset of frequent δ -tolerance closed graphs withminimum support s using their relative support as a weight.
16 / 48
Graph Coresets
Relative support of a closed graphSupport of a graph minus the relative support of its closedsupergraphs.
The sum of the closed supergraphs’ relative supports of agraph and its relative support is equal to its own support.
(s,δ )-coreset for the problem of computing closedgraphsWeighted multiset of frequent δ -tolerance closed graphs withminimum support s using their relative support as a weight.
16 / 48
Graph DatasetTransaction Id Graph Weight
1
C C S N
O
O 1
2
C C S N
O
C 1
3
C S N
O
C 1
4 C C S N
N
1
17 / 48
Graph Coresets
Graph Relative Support SupportC C S N 3 3
C S N
O
3 3
C S
N
3 3
Table: Example of a coreset with minimum support 50% and δ = 1
18 / 48
Graph Coresets
Figure: Number of graphs in a (40%,δ )-coreset for NCI.
Input: A graph dataset D, and min sup.Output: The frequent graph set G.
1 G← /02 for every batch bt of graphs in D3 do C← CORESET(bt ,min sup)4 G← CORESET(G∪C,min sup)5 return G
21 / 48
WINGRAPHMINER
WINGRAPHMINER(D,W ,min sup)
Input: A graph dataset D, a size window W and min sup.Output: The frequent graph set G.
1 G← /02 for every batch bt of graphs in D3 do C← CORESET(bt ,min sup)4 Store C in sliding window5 if sliding window is full6 then R← Oldest C stored in sliding window,
negate all support values7 else R← /08 G← CORESET(G∪C∪R,min sup)9 return G
22 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 1
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 1 W1 = 01010110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 10 W1 = 1010110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 101 W1 = 010110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 1010 W1 = 10110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 10101 W1 = 0110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 101010 W1 = 110111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 1010101 W1 = 10111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111W0= 10101011 W1 = 0111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 101010110111111 Drop elements from the tail of WW0= 101010110 W1 = 111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdowExample
W= 01010110111111 Drop elements from the tail of WW0= 101010110 W1 = 111111
ADWIN: ADAPTIVE WINDOWING ALGORITHM
1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW
23 / 48
Algorithm ADaptive Sliding WINdow
TheoremAt every time step we have:
1 (False positive rate bound). If µt remains constant withinW, the probability that ADWIN shrinks the window at thisstep is at most δ .
2 (False negative rate bound). Suppose that for somepartition of W in two parts W0W1 (where W1 contains themost recent items) we have |µW0−µW1 |> 2εc . Then withprobability 1−δ ADWIN shrinks W to W1, or shorter.
ADWIN tunes itself to the data stream at hand, with no need forthe user to hardwire or precompute parameters.
24 / 48
Algorithm ADaptive Sliding WINdow
ADWIN using a Data Stream Sliding Window Model,can provide the exact counts of 1’s in O(1) time per point.tries O(logW ) cutpointsuses O(1
εlogW ) memory words
the processing time per example is O(logW ) (amortizedand worst-case).
1 G← /02 Init ADWIN3 for every batch bt of graphs in D4 do C ← CORESET(bt ,min sup)5 R← /06 if Mode is Sliding Window7 then Store C in sliding window8 if ADWIN detected change9 then R← Batches to remove
in sliding windowwith negative support
10 G← CORESET(G∪C∪R,min sup)11 if Mode is Sliding Window12 then Insert # closed graphs into ADWIN13 else for every g in G update g’s ADWIN14 return G
26 / 48
ADAGRAPHMINER
ADAGRAPHMINER(D,Mode,min sup)
1 G← /02 Init ADWIN3 for every batch bt of graphs in D4 do C ← CORESET(bt ,min sup)5 R← /06789
10 G← CORESET(G∪C∪R,min sup)111213 for every g in G update g’s ADWIN14 return G
26 / 48
ADAGRAPHMINER
ADAGRAPHMINER(D,Mode,min sup)
1 G← /02 Init ADWIN3 for every batch bt of graphs in D4 do C ← CORESET(bt ,min sup)5 R← /06 if Mode is Sliding Window7 then Store C in sliding window8 if ADWIN detected change9 then R← Batches to remove
in sliding windowwith negative support
10 G← CORESET(G∪C∪R,min sup)11 if Mode is Sliding Window12 then Insert # closed graphs into ADWIN1314 return G
Clustering Experimental SettingInternal measures External measuresGamma Rand statisticC Index Jaccard coefficientPoint-Biserial Folkes and Mallow IndexLog Likelihood Hubert Γ statisticsDunn’s Index Minkowski scoreTau PurityTau A van Dongen criterionTau C V-measureSomer’s Gamma CompletenessRatio of Repetition HomogeneityModified Ratio of Repetition Variation of informationAdjusted Ratio of Clustering Mutual informationFagan’s Index Class-based entropyDeviation Index Cluster-based entropyZ-Score Index PrecisionD Index RecallSilhouette coefficient F-measure
Table: Internal and external clustering evaluation measures.36 / 48
Cluster Mapping Measure
Hardy Kremer, Philipp Kranen, Timm Jansen, ThomasSeidl, Albert Bifet, Geoff Holmes and Bernhard Pfahringer.An Effective Evaluation Measure for Clustering on EvolvingData StreamKDD’11
CMM: Cluster Mapping MeasureA novel evaluation measure for stream clustering on evolvingstreams