A Polynomial Time Matching Algorithm of Ordered Tree Patterns having Height- Constrained Variables Kazuhide Aikou 1 , Yusuke Suzuki 1,2 , Takayoshi Shoudai 1 , Tomoyuki Uchida 2 , Tetsuhiro Miyahara 2 1. Department of Informatics, Kyushu University, Japan 2. Faculty of Information Sciences, Hiroshima City University, Japan
38
Embed
A Polynomial Time Matching Algorithm of Ordered Tree Patterns having Height-Constrained Variables Kazuhide Aikou 1, Yusuke Suzuki 1,2, Takayoshi Shoudai.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Polynomial Time Matching Algorithm of Ordered Tree Patterns having Height-Constrained Variables
1. Department of Informatics, Kyushu University, Japan
2. Faculty of Information Sciences, Hiroshima City University, Japan
Contents
1. Backgrounds and Motivations
2. Preliminaries
- Ordered Term Trees
- Height-Constrained Variables
3. A Matching Algorithm of Ordered Term Trees having Height-Constrained Variables
4. Conclusions and Future Works
Increase of Tree-structured Data( Web Documents, HTML/XML, etc. )
Discovery of Tree-structured PatternsCommon to Tree-structured Data
App.:Knowledge Discoveryfrom Web Documents
<Salesperiod> <Quarter>Winter1998</Quarter> <Design> <Designnumber>C365</Designnumber> <Description>North Star Polo</Description> <Unitssold>35500</Unitssold> </Design></Salesperiod>
<Quarter>
Winter1998
<Salesperiod>
<Design>
<Designnumber> <Unitssold><Description>
C365 North Star Polo 35500
<HTML>
<Head> <Body>
<Title><Table>
Text_university
<Table> <Table>
Ordered Term Trees
Our Works:• COLT for Term Trees• Web Mining Systems Using Learning
Algorithms for Term Trees
Backgrounds
Ordered trees expresssemi-structured data (HTML, XML, etc).
<HTML>
<HEAD>text1</HEAD>
<BODY>
<DIV>text2</DIV>
<FONT>text3</FONT>
<FONT>text4</FONT>
</BODY>
</HTML>
HTML Data
TAG
TEXT
Object Exchange Model
1 2
<HTML>
<HEAD> <BODY>
1 2 3
<DIV><FONT><FONT>
1text1
1 1 1text2 text3 text4
Preliminaries
<HTML>
<HEAD> <BODY>
<DIV><FONT><FONT>text1
text2 text3 text4
Ordered Trees with Edge Labels
x,y,...: variable labels
Variable h2
An ordered term treet=(V,E,H)
V: A vertex setE: An edge setH: A variable set
Ordered Tree Patterns with Internal Structured Variables
u1
u2
u5
u3
u6 u7 u8
x
y
u4
The child ports of h2
The parent portof h2
The parent port of h1
The child port of h1
Variables with at least one child port
Multi-child port variables
A variable can be substituted with an arbitrary ordered tree.
Variable h1
Variables with exactly one child port
Single-child port variables
Ordered Term Trees with Multi-Child Port Variables
vi
w4
w2
w3
w1
vi
w4
w2
w3
w1
u6u5
u2
u3v2
u1
vi
w4
w2
u7
u4u4
u7
u6u5
u2
u3v2
u1
y
v4
v3v2
v1
vi
w4
w2
w3
w1u1
x
u7
y
u6u5
u4u3u2
v4
v3v2
v1 u1
x
u7
y
u6u5
u4u3u2
v4
v3v2
v1
u4
u7
u6u5
u2
v2
y
v4
v3v2
v1 u1
u3
An ordered tree T1 An ordered treeT2
Replacements of the variables with T1 and T2 An ordered term tree t A new ordered tree T
Identify the root of T1 with the parent port.
Identify the two leaves with the two child ports.
u6u5
u2
u3v2
u1
vi
w4
w2
u7
u4
Identify the root of T2 with the parent port.
Chose one of the leaves of T2 and Identify it with the child port.
Substitutions
x
y
A substitution
match
An ordered treeA linear ordered term tree
Linear Ordered Term Trees:All variables have mutually distinct variable labels.All variable replacements are decided independently.
INPUT T: an ordered tree; t: a linear ordered termtree with multi-child port variables.
PROBLEM Does t match T?
This matching problem is computed in O(nN) time, where n is the number of vertices in t and N is the number of vertices in T [Suzuki et al., ILP 02].
This matching problem is computed in O(nN) time, where n is the number of vertices in t and N is the number of vertices in T [Suzuki et al., ILP 02].
Matching Problem for Linear Ordered Term Trees with Multi-Child Port Variables
<HTML>
<HEAD>text1</HEAD>
<BODY>
<DIV>text2</DIV>
<FONT>text3</FONT>
<FONT>text4</FONT>
</BODY>
</HTML>
An HTML file
1 2
<HTML>
<HEAD> <BODY>
1 2 3
<DIV><FONT><FONT>
1text1
1 1 1text2 text3 text4
height
Observation:Most of ordered trees obtained from HTML files have low height.
A tree of a big height is rare.Then, it becomes a feature if there is a long branch.
A tree of a big height is rare.Then, it becomes a feature if there is a long branch.
0
10
20
30
40
0 500 1000 1500 2000
Size = The number of vertices in a tree
Height
Relationships between the size of the tree representing an HTML file and the height of it.
( i , j )
( i’, j’)
0 < i j≦
The trunklength i
i
Theheight j
j
Trunk Length: The path length between the root and the leaf which are identified with the ports.
Height-constrainedHeight-constrained single-child port variablesvariables
Example.
(2,2) (2,4)
123
O.KN.G.An orderedterm tree t
An ordered tree T
A linear ordered term tree t
(1,2) (4,6)
An ordered tree T
INPUT:
PROBLEM: Does t match T?
MATCHING PROBLEMfor Linear Ordered Term Trees with Height-Constrained Single-Child Port Variables
Main TheoremMain Theorem
MATCHING PROBLEM for Linear Ordered Term Trees with Height-Constrained Single-Child Port Variables is computed in O(N max{nDmax, S}) time, where
n: the number of vertices of t,
N: the number of vertices of T,
S: the total amount of the lowest trunk lengths of all variables of t,
Dmax: the maximum number of children of a vertex of T.
Sub Term Tree and SubtreeA linear ordered term tree t An ordered tree T
(4,6)
(1,1)
t[u’](4,6)
(1,1)
u’
(1,2)
uT[u]
u and all descendants of u
-T[v]
v
which are not proper descendants of v
Idea:Corresponding Sets CS(u)
v
u
v’
(v’,i,j) CS(u)∈t T
t=(Vt,Et,Ht): a term tree, T=(VT,ET): a tree.CS(u)Vt×NN×NN : a corresponding set of a vertex uVT.
(v’,i,j) CS(u)∈ shows that there is a descendant v of u such that
(1) t[v’] matches T[v],(2) the length between u and v is i (if i < i’-1), and(3) the height of T[u]-T[v] is j.
match
v
T[v]
v
(i’,j’)
t[v’]
v’
ji
u
v
uv’ T
(v’,0,0) CS(u)∈
match
t
Therefore,(v’,0,0)CS(u) if and only if t[v’] matches T[u].
(i’,j’)
(the root of t,0,0)CS(the root of T) if and only if t matches T.
Algorithm MatchingMatching(t,T)
Initialization;
while there is an unmarked vertex u of T do begin
Mark u;
VID-Inheriting(u);
C-Set-Attaching(u)
end
1
2
3
Algorithm MatchingMatching(t,T)
Initialization;
while there is an unmarked vertex u of T do begin
Mark u;
VID-Inheriting(u);
C-Set-Attaching(u)
end
(1,2) (2,2)(1,2) (2,2)
2
1
7
3
98
4 65
Vertex identifiers
Breadth-firstsearch order
Initialization:Vertex Identifiers
A linear ordered term tree t
The children of an internal vertexhave consecutive vertex identifiers.This saves computation time of main processes.This saves computation time of main processes.
Compute the corresponding set of each vertex from leaves to the root.
t1
7
3
98
4 65
2
(1,2) (3,6)
TA
E I
C
G
N
B
J
ML
F H
K
D
Q
O
Initialization: For all leaves u of T,Mark u;CS(u):={(u’,0,0) | u’ is a leaf of t.}; height(u):=0;
VID-Inheriting (1/3): Let v’ be the child port of an (i,j)-height constrained variable. For an internal vertex u of a tree, if there is an element (v’,i’,j’) in the CS of a child of u, add (v’, min{i’+1,i-1}, *) to CS(u).
7
3
(3,6)
Example
C
J
(7,0,0) CS(∈ Q)
(7,0,0) CS(J)∈
Add (7,1,1) to CS(P)
Add (7,2,2) to CS(O)
Add (7,2,3) to CS(N)
I
N
O
P
Q
Add (7,2,4) to CS(I)
If i’=i-1 then the parent of u can match the parent port u’.
Next slide
T
cb
a
4
∈ CS(a)
3
Choose the smallest height
(7,2,4) , (7,2,5)
(7,1,1) CS(b)∈height(b)=4
(7,1,3) CS(c)∈height(c)=3
7
3
(4,6)
cb
(7,2,4) CS(a)∈
VID-Inheriting (2/3):Case: At least two children have (v’,i’,*) for a vertex v’ and an integer i’.
VID-Inheriting (3/3):Case: A child has (v’,i’,*) and another child has (v’,i’’,*) for distinct integers i’ and i’’.
cb
a
4
, ∈ CS(a)
3
(7,2,4) (7,3,5)
T
(7,1,3) CS(b)∈height(b)=4
(7,2,2) CS(c)∈
height(c)=3
7
3
(4,6)
cb
Add all triplets to CS(u) (at most i triplets)
• CS(a) contains at most S triplets.• Then the total time complexity of Inheriting of a vertex a
is O(Sma), where ma is the number of the children of a.
Algorithm MatchingMatching(t,T)
Initialization;
while there is an unmarked vertex u of T do begin
Mark u;
VID-Inheriting(u);
C-Set-Attaching(u)
end
C-Set-Attaching (Small Examples)
4 65
2
4 65
2
(1,2)
t
t
B
F HD
E G
B
F HD
(4,0,0)CS(D)
(5,0,0)CS(F)
(6,0,0)CS(H)
(2,0,0) should be added to CS(B).
(4,0,0)CS(D)
(5,0,0)CS(G)
(6,0,0)CS(H)
height(F)=2
height(E)=1
(2,0,0) is added to CS(B).
(5,0,0)CS(G) covers [E,G].
4 65
2
(1,2)
t
E G
B
F HD
(4,0,0)CS(D)
(5,1,1)CS(F)
(6,0,0)CS(H)
height(G)=2height(E)=1
(2,0,0) is added to CS(B).
(5,1,1)CS(F) covers [E,G].
4 65
2
(1,2)
t
E G
B
F HD
(4,0,0)CS(D)
(5,1,1)CS(F)
(6,0,0)CS(H)
height(G)=2height(E)=3
(2,0,0) may not be added to CS(B).
(5,1,1)CS(F) covers [F,G] but cannot cover E.
(4,8) (3,4) (5,5) (4,7)
1 2 3 4 5 6 7 8 9 10
11
C-Set-Attaching (A Big Example)
t
An ordered term tree
CS(K)=
(1,0,0),
height(A)=9
CS(A)= (2,0,0),
(4,0,0)
height(B)=5
CS(B)
= (5,0,0)height(C)=4
CS(C)= (3,3,4),
(6,0,0)
height(D)=5
CS(D)
=(3,3,3)
height(E)=3
CS(E)= (1,0,0),
(4,0,0)(7,2,3)
height(F)=2
CS(F)
=
(2,0,0),(4,0,0),(5,0,0),(8,4,4)
height(G)=5
CS(G)
=
(5,0,0),(6,0,0),(8,4,4),(9,0,0)
height(H)=6
CS(H)
=
(3,3,5),(6,0,0)
height(I)=5
CS(I)
=(7,2,3),(10,3,3)
height(J)=7
CS(J)
=
height(K)=1
φ (4,0,0),(8,4,4)
height(L)=9
CS(L)
=
(5,0,0),(9,0,0)
height(M)=4
CS(M)
=(6,0,0),(10,3,4)
height(N)=4
CS(N)
=
A B C D E F G H I J K L M N
An ordered tree O
1 2 3 4 5 6 7 8 9 10
A
B
C
D
E
F
G
H
I
J
K
L
M
N
First, we prepare a virtual table for a new graph.Rows and columns represent vertices of T and t, respectively.
(3,3,3)
height(R)=3
CS(E)= (1,0,0),
(4,0,0)(7,2,3)
height(F)=2
CS(F)
= (2,0,0),(4,0,0),(5,0,0),(8,4,4)
height(G)=5
CS(G)
=
(5,0,0),(6,0,0),(8,4,4),(9,0,0)
height(H)=6
CS(H)
=
(3,3,5),(6,0,0)
height(I)=5
CS(I)
=(3,3,4),(6,0,0)
height(F)=5
CS(D)
=
E F G H
O
ID
(3,4)
7
11
7
E
F
G
H
I
[E,F] (7,2,3)CS(F) covers [E,F].
An ordered tree An ordered term tree
Add a vertex labeled with [E,F] to F7 in the table.
(3,3,3)
height(E)=3
CS(E)= (1,0,0),
(4,0,0)(7,2,3)
height(F)=2
CS(F)
= (2,0,0),(4,0,0),(5,0,0),(8,4,4)
height(G)=5
CS(G)
=
(5,0,0),(6,0,0),(8,4,4),(9,0,0)
height(H)=6
CS(H)
=
(3,3,5),(6,0,0)
height(I)=5
CS(I)
=(3,3,4),(6,0,0)
height(D)=5
CS(D)
=
(5,5)
8
11
(3,4)
7
7 8
E
F
G
H
I
[E,G]
[E,F]
E F G H
O
ID
An ordered tree An ordered term tree
(8,4,4)CS(G) covers [E,G].
Add a vertex labeled with [E,G] to G8 in the table.
(3,3,3)
height(E)=3
CS(E)= (1,0,0),
(4,0,0)(7,2,3)
height(F)=2
CS(F)
= (2,0,0),(4,0,0),(5,0,0),(8,4,4)
height(G)=5
CS(G)
=
(5,0,0),(6,0,0),(8,4,4),(9,0,0)
height(H)=6
CS(H)
=
(3,3,5),(6,0,0)
height(I)=5
CS(I)
=(3,3,4),(6,0,0)
height(D)=5
CS(D)
=
(5,5)
8
11
(3,4)
7
7 8
E
F
G
H
I
[E,G]
[H,H]
[E,F]
E F G H
O
ID
An ordered tree An ordered term tree
(8,4,4)CS(H) covers [H,H].Add a directed edge from [E,F] at F7 to [E,G] at G8, because two consecutive variables cover all vertices from E to G.
Add a vertex labeled with [H,H] to H8 in the table.
1 2 3 4 5 6 7 8 9 10
A
B
C
D
E
F
G
H
I
J
K
L
M
N
[B,K]
[B,K]
[J,K]
[K,N]
[E,F]
[H,H]
[M,N]
[B,K]
[B,K]
vstart
vgoal
[B,K]
[J,K]
[K,N]
[M,N]
[E,G]
• If there is a directed path from vstart to vgoal, (11,0,0) is added to CS(O).
• The total time complexity of C-Set-Attaching of a vertex u of T and a vertex u’ of t is O(mu
2 m’u’), where mu and m’u’ are the numbers of the children of u and u’, respectively.
2m’u’)mu: the number of children of a vertex u of T,
m’u’: the number of children of a vertex u’ of t. Total: O(N max{nDmax,S})
n: the number of vertices of t,N: the number of vertices of T,S: the total amount of the lowest trunk lengths of all variables of t,
Dmax: the maximum number of children of a vertex of T.
Conclusions• An O(N max{nDmax,S}) Time Matching Algorithm for
Ordered Term Trees with Height-Constrained Variables.
• [Our Related Works] Polynomial-Time Learning Algorithms for Ordered Term Trees with Height-Constrained Variables [Suzuki et al., PRICAI'04], [Matsumoto and Shoudai, ALT'04].
Future Works:Future Works:• An Efficient Matching Algorithm for Ordered Term Trees
with Height-Constrained Multi-Child Port Variables.
• Polynomial-Time Learning Algorithms for Ordered Term Trees with Height-Constrained Multi-Child Port Variables.