Adaptive XML Tree Mining on Evolving Data Streams Albert Bifet Laboratory for Relational Algorithmics, Complexity and Learning LARCA Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Porto, 21 May 2009
May 08, 2015
Adaptive XML Tree Mining on Evolving Data Streams
Albert Bifet
Laboratory for Relational Algorithmics, Complexity and Learning LARCADepartament de Llenguatges i Sistemes Informàtics
Universitat Politècnica de Catalunya
Porto, 21 May 2009
Mining Evolving Massive Structured Data
The Disintegration of Persistenceof Memory 1952-54
Salvador Dalí
The basic problemFinding interesting structureon data
Mining massive data
Mining time varying data
Mining on real time
Mining XML data
2 / 30
XML Tree Classification on evolving datastreams
D
D
B
C
A
C
D
B
C
B
D
B
C C
B
D
B
C
A
B
CLASS1 CLASS2 CLASS1 CLASS2
Figure: A dataset example
3 / 30
Tree Pattern Mining
Trees are sanctuaries.Whoever knows how
to listen to them,can learn the truth.
Herman Hesse
Given a dataset of trees, find thecomplete set of frequent subtrees
Frequent Tree Pattern (FT):
Include all the trees whosesupport is no less than min_sup
Closed Frequent Tree Pattern(CT):
Include no tree which has asuper-tree with the samesupport
CT ⊆ FT
4 / 30
Mining Closed Frequent Trees
Our trees are:
Labeled and Unlabeled
Ordered and Unordered
Our subtrees are:
Induced
Top-down
Two different ordered treesbut the same unordered tree
5 / 30
A tale of two trees
Consider D = {A,B}, where
A:
B:
and let min_sup = 2.
Frequent subtreesBA
6 / 30
A tale of two trees
Consider D = {A,B}, where
A:
B:
and let min_sup = 2.
Closed subtreesBA
6 / 30
XML Tree Classification on evolving datastreams
D
D
B
C
A
C
D
B
C
B
D
B
C C
B
D
B
C
A
B
CLASS1 CLASS2 CLASS1 CLASS2
Figure: A dataset example
7 / 30
XML Tree Classification on evolving datastreams
Tree Trans.Closed Freq. not Closed Trees 1 2 3 4
c1
D
B
C C
B
C C 1 0 1 0
c2
D
B
C
A
B
C
A
C
A
A
1 0 0 1
8 / 30
XML Tree Classification on evolving datastreamsFrequent Trees
c1 c2 c3 c4Id c1 f 1
1 c2 f 12 f 2
2 f 32 c3 f 1
3 c4 f 14 f 2
4 f 34 f 4
4 f 54
1 1 1 1 1 1 1 0 0 1 1 1 1 1 12 0 0 0 0 0 0 1 1 1 1 1 1 1 13 1 1 0 0 0 0 1 1 1 1 1 1 1 14 0 0 1 1 1 1 1 1 1 1 1 1 1 1
Closed MaximalTrees Trees
Id Tree c1 c2 c3 c4 c1 c2 c3 Class1 1 1 0 1 1 1 0 CLASS12 0 0 1 1 0 0 1 CLASS23 1 0 1 1 1 0 1 CLASS14 0 1 1 1 0 1 1 CLASS2
9 / 30
XML Tree Framework on evolving datastreams
XML Tree Classification Framework Components
An XML closed frequent tree miner
A Data stream classifier algorithm, which we will feed with tuplesto be classified online.
10 / 30
Mining Evolving Tree Data Streams
ProblemGiven a data stream D of rooted and unordered trees, findfrequent closed trees.
D
We provide three algorithms,of increasing power
Incremental
Sliding Window
Adaptive
11 / 30
Mining Closed Unordered Subtrees
CLOSED_SUBTREES(t ,D ,min_sup,T )
123 for every t ′ that can be extended from t in one step4 do if Support(t ′) ≥min_sup5 then T ← CLOSED_SUBTREES(t ′,D ,min_sup,T )6789
10 return T
12 / 30
Mining Closed Unordered Subtrees
CLOSED_SUBTREES(t ,D ,min_sup,T )
1 if not CANONICAL_REPRESENTATIVE(t)2 then return T3 for every t ′ that can be extended from t in one step4 do if Support(t ′) ≥min_sup5 then T ← CLOSED_SUBTREES(t ′,D ,min_sup,T )6789
10 return T
12 / 30
Mining Closed Unordered Subtrees
CLOSED_SUBTREES(t ,D ,min_sup,T )
1 if not CANONICAL_REPRESENTATIVE(t)2 then return T3 for every t ′ that can be extended from t in one step4 do if Support(t ′) ≥min_sup5 then T ← CLOSED_SUBTREES(t ′,D ,min_sup,T )6 do if Support(t ′) = Support(t)7 then t is not closed8 if t is closed9 then insert t into T
10 return T
12 / 30
ExampleD = {A,B}
min_sup = 2.
〈A〉= (0,1,2,3,2,1) 〈B〉= (0,1,2,3,1,2,2)
(0) (0,1)
(0,1,1)
(0,1,2)
(0,1,2,1)
(0,1,2,2)
(0,1,2,3)
(0,1,2,2,1)
(0,1,2,3,1)
13 / 30
ExampleD = {A,B}
min_sup = 2.
〈A〉= (0,1,2,3,2,1) 〈B〉= (0,1,2,3,1,2,2)
(0) (0,1)
(0,1,1)
(0,1,2)
(0,1,2,1)
(0,1,2,2)
(0,1,2,3)
(0,1,2,2,1)
(0,1,2,3,1)
13 / 30
Experimental results
TreeNat
Unlabeled Trees
Top-Down Subtrees
No Occurrences
CMTreeMiner
Labeled Trees
Induced Subtrees
Occurrences
14 / 30
Closure Operator on Trees
D : the finite input dataset of trees
T : the (infinite) set of all trees
DefinitionWe define the following the Galois connection pair:
For finite A⊆D
σ(A) is the set of subtrees of the A trees in T
σ(A) = {t ∈T∣∣ ∀ t ′ ∈ A(t � t ′)}
For finite B ⊂T
τD (B) is the set of supertrees of the B trees in D
τD (B) = {t ′ ∈D∣∣ ∀ t ∈ B (t � t ′)}
Closure OperatorThe composition ΓD = σ ◦ τD is a closure operator.
15 / 30
Galois Lattice of closed set of trees
1 2 3
12 13 23
123
16 / 30
Galois Lattice of closed set of trees
D
B = { }
1 2 3
12 13 23
12317 / 30
Galois Lattice of closed set of trees
B = { }
τD(B) = { , }
1 2 3
12 13 23
12317 / 30
Galois Lattice of closed set of trees
B = { }
τD(B) = { , }
ΓD(B) = σ ◦τD(B) = { and its subtrees }
1 2 3
12 13 23
12317 / 30
Algorithms
Algorithms
Incremental: INCTREENAT
Sliding Window: WINTREENAT
Adaptive: ADATREENAT Uses ADWIN to monitor change
ADWIN
An adaptive sliding window whose size is recomputed onlineaccording to the rate of change observed.
ADWIN has rigorous guarantees (theorems)
On ratio of false positives and false negatives
On the relation of the size of the current window and changerates
18 / 30
Experimental Validation: TN1
INCTREENAT
CMTreeMiner
Time(sec.)
Size (Milions)2 4 6 8
100
200
300
Figure: Experiments on ordered trees with TN1 dataset
19 / 30
What is MOA?
{M}assive {O}nline {A}nalysis is a framework for online learningfrom data streams.
It is closely related to WEKA
It includes a collection of offline and online as well as tools forevaluation:
boosting and baggingHoeffding Trees
with and without Naïve Bayes classifiers at the leaves.
20 / 30
WEKA: the bird
21 / 30
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.
22 / 30
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.
22 / 30
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.
22 / 30
Data stream classification cycle
1 Process an example at atime, and inspect it onlyonce (at most)
2 Use a limited amount ofmemory
3 Work in a limited amountof time
4 Be ready to predict at anypoint
23 / 30
Environments and Data Sources
Environments
Sensor Network: 100Kb
Handheld Computer: 32 Mb
Server: 400 Mb
Data Sources
Random Tree Generator
Random RBF Generator
LED Generator
Waveform Generator
Function Generator
24 / 30
Algorithms
Naive Bayes
Decision stumps
Hoeffding Tree
Hoeffding Option Tree
Bagging and Boosting
Prediction strategies
Majority class
Naive Bayes Leaves
Adaptive Hybrid
25 / 30
Hoeffding Option Tree
Hoeffding Option TreesRegular Hoeffding tree containing additional option nodes thatallow several tests to be applied, leading to multiple Hoeffdingtrees as separate paths.
26 / 30
GUIjava -cp .:moa.jar:weka.jar-javaagent:sizeofag.jar moa.gui.TaskLauncher
27 / 30
GUIjava -cp .:moa.jar:weka.jar-javaagent:sizeofag.jar moa.gui.TaskLauncher
27 / 30
Ensemble Methods
http://www.cs.waikato.ac.nz/∼abifet/MOA/
New ensemble methods:
ADWIN bagging: When a change is detected, the worst classifieris removed and a new classifier is added.
Adaptive-Size Hoeffding Tree bagging
28 / 30
XML Tree Framework on evolving datastreams
Maximal Closed
# Trees Att. Acc. Mem. Att. Acc. Mem.
CSLOG12 15483 84 79.64 1.2 228 78.12 2.54CSLOG23 15037 88 79.81 1.21 243 78.77 2.75CSLOG31 15702 86 79.94 1.25 243 77.60 2.73CSLOG123 23111 84 80.02 1.7 228 78.91 4.18
Table: BAGGING on unordered trees.
29 / 30
Conclusions
XML tree stream classifier system.
Using Galois Latice Theory, we present methods for miningclosed trees
IncrementalSliding WindowAdaptive: using ADWIN to monitor change
We use MOA data stream classifiers.
30 / 30