Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Post on 08-May-2015

542 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Talk about tree mining on evolving data streams.

Transcript

Mining Adaptively Frequent Closed UnlabeledRooted Trees in Data Streams

Albert Bifet and Ricard Gavaldà

Universitat Politècnica de Catalunya

14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD’08)

2008 Las Vegas, USA

Data StreamsSequence is potentiallyinfiniteHigh amount of data:sublinear spaceHigh speed of arrival:sublinear time perexample

Tree MiningMining frequent trees isbecoming an importanttaskApplications:

chemical informaticscomputer visiontext retrievalbioinformaticsWeb analysis.

Many link-basedstructures may bestudied formally bymeans of unorderedtrees

Introduction: Data Streams

Data StreamsSequence is potentially infiniteHigh amount of data: sublinear spaceHigh speed of arrival: sublinear time per exampleOnce an element from a data stream has been processedit is discarded or archived

ExamplePuzzle: Finding Missing Numbers

Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one elementmissing.π−1[i] arrives in increasing order

Task: Determine the missing number

Introduction: Data Streams

Data StreamsSequence is potentially infiniteHigh amount of data: sublinear spaceHigh speed of arrival: sublinear time per exampleOnce an element from a data stream has been processedit is discarded or archived

ExamplePuzzle: Finding Missing Numbers

Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one elementmissing.π−1[i] arrives in increasing order

Task: Determine the missing number

Use a n-bitvector tomemorize all thenumbers (O(n)space)

Introduction: Data Streams

Data StreamsSequence is potentially infiniteHigh amount of data: sublinear spaceHigh speed of arrival: sublinear time per exampleOnce an element from a data stream has been processedit is discarded or archived

ExamplePuzzle: Finding Missing Numbers

Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one elementmissing.π−1[i] arrives in increasing order

Task: Determine the missing number

Data Streams:O(log(n)) space.

Introduction: Data Streams

Data StreamsSequence is potentially infiniteHigh amount of data: sublinear spaceHigh speed of arrival: sublinear time per exampleOnce an element from a data stream has been processedit is discarded or archived

ExamplePuzzle: Finding Missing Numbers

Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one elementmissing.π−1[i] arrives in increasing order

Task: Determine the missing number

Data Streams:O(log(n)) space.Store

n(n +1)

2−∑

j≤iπ−1[j].

Introduction: Trees

Our trees are:UnlabeledOrdered and Unordered

Our subtrees are:Induced

Two different ordered treesbut the same unordered tree

Introduction

Induced subtrees: obtained by repeatedly removing leafnodes

Embedded subtrees: obtained by contracting some of theedges

Introduction

What Is Tree Pattern Mining?

Given a dataset of trees, find the complete set of frequentsubtrees

Frequent Tree Pattern (FS):

Include all the trees whose support is no less than min_sup

Closed Frequent Tree Pattern (CS):

Include no tree which has a super-tree with the samesupport

CS ⊆ FSClosed Frequent Tree Mining provides a compactrepresentation of frequent trees without loss of information

Introduction

Unordered Subtree Mining

A: B: X: Y:X: Y:

D = {A,B},min_sup = 2

# Closed Subtrees : 2# Frequent Subtrees: 9

Closed Subtrees: X, Y

Frequent Subtrees:

Introduction

ProblemGiven a data stream D of rooted, unlabelled and unorderedtrees, find frequent closed trees.

D

We provide three algorithms,of increasing power

IncrementalSliding WindowAdaptive

Outline

1 Introduction

2 Data Streams

3 ADWIN : Concept Drift Mining

4 Adaptive Closed Frequent Tree Mining

5 Summary

Data Streams

Data StreamsAt any time t in the data stream, we would like the per-itemprocessing time and storage to be simultaneouslyO(logk (N, t)).

Approximation algorithmsSmall error rate with high probabilityAn algorithm (ε,δ )−approximates F if it outputs F̃ forwhich Pr[|F̃ −F |> εF ] < δ .

Data Streams Approximation Algorithms

1011000111 1010101

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Data Streams Approximation Algorithms

10110001111 0101011

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Data Streams Approximation Algorithms

101100011110 1010111

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Data Streams Approximation Algorithms

1011000111101 0101110

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Data Streams Approximation Algorithms

10110001111010 1011101

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Data Streams Approximation Algorithms

101100011110101 0111010

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Outline

1 Introduction

2 Data Streams

3 ADWIN : Concept Drift Mining

4 Adaptive Closed Frequent Tree Mining

5 Summary

ADWIN: Adaptive sliding window

ADWIN

An adaptive sliding window whose size is recomputed onlineaccording to the rate of change observed.

ADWIN has rigorous guarantees (theorems)On ratio of false positives and negativesOn the relation of the size of the current window andchange rates

ADWIN using a Data Stream Sliding Window Model,can provide the exact counts of 1’s in O(1) time per point.tries O(logW ) cutpointsuses O(1

εlogW ) memory words

the processing time per example is O(logW ) (amortizedand worst-case).

Time Change Detectors and Predictors: A GeneralFramework

-xt

Estimator

-Estimation

Time Change Detectors and Predictors: A GeneralFramework

-xt

Estimator

-Estimation

- -Alarm

Change Detect.

Time Change Detectors and Predictors: A GeneralFramework

-xt

Estimator

-Estimation

- -Alarm

Change Detect.

Memory-

6

6?

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

1 01010110111111

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

10 1010110111111

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

101 010110111111

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

1010 10110111111

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

10101 0110111111

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

101010 110111111

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

1010101 10111111

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

10101011 0111111

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

101010110 111111

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

1010101101 11111

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

10101011011 1111

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

101010110111 111

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

1010101101111 11

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

10101011011111 1

11

Outline

1 Introduction

2 Data Streams

3 ADWIN : Concept Drift Mining

4 Adaptive Closed Frequent Tree Mining

5 Summary

Pattern Relaxed Support

Guojie Song, Dongqing Yang, Bin Cui, Baihua Zheng,Yunfeng Liu and Kunqing Xie.CLAIM: An Efficient Method for Relaxed Frequent ClosedItemsets Mining over Stream Data

Linear Relaxed Interval:The support space of allsubpatterns can be divided into n = d1/εre intervals, whereεr is a user-specified relaxed factor, and each interval canbe denoted by Ii = [li ,ui), where li = (n− i)∗ εr ≥ 0,ui = (n− i +1)∗ εr ≤ 1 and i ≤ n.Linear Relaxed closed subpattern t : if and only if thereexists no proper superpattern t ′ of t such that their suportsbelong to the same interval Ii .

Pattern Relaxed Support

As the number of closed frequent patterns is not linear withrespect support, we introduce a new relaxed support:

Logarithmic Relaxed Interval:The support space of allsubpatterns can be divided into n = d1/εre intervals, whereεr is a user-specified relaxed factor, and each interval canbe denoted by Ii = [li ,ui), where li = dc ie, ui = dc i+1−1eand i ≤ n.Logarithmic Relaxed closed subpattern t : if and only ifthere exists no proper superpattern t ′ of t such that theirsuports belong to the same interval Ii .

Galois Lattice of closed set of trees

D

We needa Galoisconnection paira closure operator

1 2 3

12 13 23

123

Incremental mining on closed frequent trees

1 Adding a treetransaction, doesnot decrease thenumber of closedtrees for D .

2 Adding atransaction with aclosed tree, doesnot modify thenumber of closedtrees for D .

1 2 3

12 13 23

123

Sliding Window mining on closed frequent trees

1 Deleting a treetransaction, doesnot increase thenumber of closedtrees for D .

2 Deleting a treetransaction that isrepeated, does notmodify the numberof closed trees forD .

1 2 3

12 13 23

123

Algorithms

AlgorithmsIncremental: INCTREENAT

Sliding Window: WINTREENAT

Adaptive: ADATREENAT Uses ADWIN to monitor change

ADWIN

An adaptive sliding window whose size is recomputed onlineaccording to the rate of change observed.

ADWIN has rigorous guarantees (theorems)On ratio of false positives and negativesOn the relation of the size of the current window andchange rates

Experimental Validation: TN1

INCTREENAT

CMTreeMiner

Time(sec.)

Size (Milions)2 4 6 8

100

200

300

Figure: Time on experiments on ordered trees on TN1 dataset

Experimental Validation

5

15

25

35

45

0 21.460 42.920 64.380 85.840 107.300 128.760 150.220 171.680 193.140

Number of Samples

Nu

mb

er

of

Clo

se

d T

ree

s

AdaTreeInc 1

AdaTreeInc 2

Figure: Number of closed trees maintaining the same number ofclosed datasets on input data

Outline

1 Introduction

2 Data Streams

3 ADWIN : Concept Drift Mining

4 Adaptive Closed Frequent Tree Mining

5 Summary

Summary

ConclusionsNew logarithmic relaxed closed supportUsing Galois Latice Theory, we present methods for miningclosed trees

Incremental: INCTREENATSliding Window: WINTREENATAdaptive: ADATREENAT using ADWIN to monitor change

Future WorkLabeled Trees and XML data.

top related