e Inclusion, Signatures, and Evaluation Path-Oriented Queries Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg, Canada vation -Oriented Queries and Tree Inclusion Problem uation of Path-Oriented Queries p-down Algorithm for Tree Inclusion tegration of Signatures into Top-down Tree Inclusio riment Results ary and Future Work
36
Embed
Tree Inclusion, Signatures, and Evaluation of Path-Oriented Queries Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg, Canada Motivation.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Tree Inclusion, Signatures, and Evaluation ofPath-Oriented Queries
Dr. Yangjun Chen
Dept. Applied Computer Science, University of Winnipeg, Canada
• Motivation• Path-Oriented Queries and Tree Inclusion Problem• Evaluation of Path-Oriented Queries
- Top-down Algorithm for Tree Inclusion- Integration of Signatures into Top-down Tree Inclusion
• Experiment Results• Summary and Future Work
Motivation
• Local Information Resource Management – document databases• Internet – Distributed Document Databases• Document Databases
- Storage of documents in relational databasesnon-structured data, semi-structured data
- Evaluation of path-oriented queries in document databasespath-oriented languages: XQL, XPath, and XML-QLQuery evaluation methods:
•inverse-file based•signature based
•string-matching based: suffix trees, Pat-trees
•tree-inclusion based• Integrating signatures into top-down tree inclusion algorithm
Path-Oriented Queries and Tree Inclusion Problem
• XML Documents and Path-Oriented Queries
<ho tel-roo m-reservat ion filecod=”1302 ”>< name>Travel-lodeg</name>
< from> April 20, 2002</from>< to>A pril 28, 2002< /to>
XML document:
/letter//body [para $contains$‘visited’].
/hotel-room-reservation/name ?x
/hotel-room-reservation/location
/hotel-room-reservation/location
Path-oriented queries:
/address [street = ‘510 Portage Ave.’].
[city-or-district = ‘Winnipeg’]
Single-path query:
Multiple-path query:
Path-Oriented Queries and Tree Inclusion Problem
• Tree Inclusion Problem
Definition (tree embedding) Let T and P be two labeled trees. A mapping M from the nodes of P to the nodes of T is an embedding of P
into T if it preserves labels and ancestorship. That is, for all nodes u and v of P, we require that
a) M(u) = M(v) if and only if u = v,
b) label(u) = label(M(u)),
c) u is an ancestor of v in P if and only if M(u) is an ancestor of M(v) in T, and
d) v is to the left of u iff M(v) is to the left of M(u).
An embedding is root preserving if M(root(P)) = root(T). It can be shown that restricting to root-preserving embedding does not lose generality.
Path-Oriented Queries and Tree Inclusion Problem
Example:
Hotel-room-reservation Hotel-room-reservation
name location type reservation name location
Travel-lodge City-or-district
state country address rooms price from to ?x City-or-district
number
515 Portage Ave.
street
address
Winnipeg Manitoba Canada
number
street
Post-code
one-bed-room
$119.00 April 20,2005
April 28,2005
Winnipeg
515 Portage Ave.R3B 2E9
T: P:
M (P .h o t e l -r o o m - re s e r v a t io n ) = T .h o t e l -r o om -r e s e r va t io n
M (P .n a m e ) = T . n a m e
M (P .l o ca t io n ) = T .l o ca ti o n
M ( P . Tr a v e l - l o d g e ) = T .? x
M ( P . c i t y -o r-d i s tr ic t ) = T. c i ty - o r-d i st r i c t M ( P . a d d re s s) = T .a d d re s s
M (P . W i n n i p e g ) = T .W i n n ip e g
M (P . 5 15 ) = T .5 1 5
M (P . ‘P o r ta g e e A v e .’) = T .‘ P o rt a g e A v e .’
Path-Oriented Queries and Tree Inclusion Problem
- Algorithms for Tree Inclusion Problem
Bottom-up algorithm:
• Kilpelainen-Mannila’s Algorithm (Pekka Kilpelainen and Heikki
Mannila, Ordered and unordered tree inclusion, SIAM Journal of
Computing, 24:340-356, 1995.)
O(|T| |P|) time
O(|T| |P|) space
• Chen’s Algorithm (W. Chen, More efficient algorithm for ordered
tree inclusion, Journal of Algorithms, 26:370-385, 1998.)
O(T|leaves(P)|) time
O(|leaves(P)|min{height(P), |leaves(T)|}) space
Path-Oriented Queries and Tree Inclusion Problem
- Algorithms for Tree Inclusion Problem
Top-down algorithms:• Y. Chen and Y.B. Chen, An Efficient Top-down Algorithm for Tree
Inclusion, in Proc. of 18th Intl. Conf. Symposium on High Performance
Computing System and Application, Winnipeg, Canada: IEEE,
May 2004, pp. 183-187.)
O(|T| |leaves(P)|) time, need no extra space
• Y. Chen and Y.B. Chen, On the Top-down Tree Inclusion Algorithm,
submitted to Information Processing Letters.)
O(T|height(P)|) time, need no extra space
• Advantages of top-down over bottom-up:
- better computational complexities
- checking trees page-wise (suitable for the cases of large data volume)
- integrating signatures into tree inclusion to cut useless subtree checkings
as early as possible
Evaluation of Path-Oriented Queries
- Top-down Algorithm
Target tree: T = <t; T1, ..., Tk>, where t = root(T) and each Ti (i = 1, …, k)
is the subtrees of t;
Pattern forest: G = <P1, ..., Pq>, where each Pj (j = 1, …, q) is a subtree.
• Main idea:
The algorithm attempts to find the number of subtrees j () within an
ordered forest G = <P1, ..., Pq> (q ), which are embedded in a target
tree T. If j = q, we say that G is embedded in T. If j < q, then only the trees
P1, ..., and Pj are embedded in T. Let p1, ..., pq and t be the roots of P1, ..., Pq
and T, respectively. Since a forest does not have a root, we use a virtual
node pv to serve as a substitute for root(G). Thus, root(G) will return pv if
G = <P1, ..., Pq> with q , and will return p1 if q = 1.
Evaluation of Path-Oriented Queries
- Top-down Algorithm
Case 1: root(G) pv (i.e., G = <P> is a tree and root(G) = p), and
label(p) label(t). If G is embedded in T, then there must exist a subtree Ti of
t such that it contains the whole G. The algorithm should return 1 if an
embedding can be found and 0 if it cannot.
Ti
label(root(T)) label(root(G))
Tree G is included in Ti.
T: G:
Evaluation of Path-Oriented Queries
- Top-down Algorithm
Case 2: root(G) pv (i.e., G = <P> and root(G) = p), and label(p) label(t).
Let <P1, ..., Pl> (l ) be the forest of subtrees of p and <T1, ..., Tk> the forest
of subtrees of t. If G is embedded in T, there must exist two sequences of
integers: k1, ..., kg and l1, ..., lg (g l) such that includes < , ..., >
(i = 1, ..., g, l0 = 0, lg = l), where < , ..., > represents a forest containing
subtrees , ..., and . Thus, if lg = l, the algorithm should return 1 since we
have a root preserving inclusion of G in T. Otherwise, it should return 0.
TkiPli 1 1 Pli
Pli 1 1 Pli
Pli 1 1 Pli
T: t p
T1
… … … … … …
label(root(T)) = label(root(G))
includeinclude
Tk1Tk g Tk P1
Pl1 Plg 1 1= Pl
Plg
G:
Evaluation of Path-Oriented Queries
- Top-down Algorithm
Case 2: root(G) = pv and there exists an integer j (0 j q) such that
<P1, ..., Pj> is included in T. If j = q, then the whole G is embedded in T.
There are two possibilities to be considered when looking for j. The first
possibility is similar to Case 2, where there are two sequences of integers:
k1, ..., kg and l1, ..., lg (g q) that represent the order, in which the subtrees
of root(G) are embedded in the subtrees of root(T). In thiscase, j = lg.
If j = 0, we will check the second possibility to see whether there exists a
root preserving inclusion of P1 in T, i.e., label(p1) = label(t) and the subtrees
of p1 are included in the subtrees of t. In this case, j = 1.
Evaluation of Path-Oriented Queries
- Top-down Algorithm
T:t
qv (virtual node)
T1
… … … … … …include
include
Tk1Tk g Tk P1
Pl1 Plg 1 1= Pl
Plg
G:
possibility 1:
T:t
qv (virtual node)
T1
… … … … … …
include
Tk1Tk g Tk P1
Pl1 Plg 1 1= Pl
Plg
G:possibility 2:
label(root(T)) = label(root(P1))
Evaluation of Path-Oriented Queries
- Top-down Algorithm
function top-down-process(T, G)
input: T = <t; T1, ..., Tk>, G = <p; P1, ..., Pq>(*p may or may not be a virtual node.*)output: if root(G) is virtual, returns j 0;else returns 1 if T includes G; otherwise returns 0.begin1. if root(G) is virtual then2. if (|T | < |P1| + |P2| or p has only one child)3. then G := P1; 4. else {j := bottom-up-process(T, G);5. if (j = 0 and label(t) = label(P1’s root))
(*second possibility in Case 3*)6. then {change P1’s root to a virtual node;
x := bottom-up-process(T, P1);7. if (x = the number of the children of P1’s root)
then j := 1 else j := 0;} 8. return j;}}9. if |T| |G | return 0;10.else {if (label(t) = label(p)) (*handling Case 2*)11. then {p := virtual node;
12. j := bottom-up-process(T, G);13. if (j = l) then return 1 else 0;}
else {if t is a leaf then return 0;14. (*handling Case 1*)15. i := 1;16. while (i k) do17. {if top-down-process(Ti, G) > 0 then return 1;18. i := i + 1;}19. return 0;} }end
function bottom-up-process(T, G)
input: T = <t; T1, ..., Tk>, G = <p; P1, ..., Pq>
output: j - an integer
begin
1. j := 0; i := 1;
2. while (j < q and i k) do
3. { x := top-down-process(Ti, G);
4. j := j + x; G := <p; Pj+1, ..., Pq>; i := i + 1; }
end
Integration of Signatures into Top-down Inclusion
Definition A signature for a key word or an attribute value is
hash-coded bit string.
- Example: (constructing a signature for a word with m = 4 and F = 12)
- Determine Procedure for calculating signature length
0
100
50
150
200
250
Num be r of key w o rds
sign
atur
e le
ngth
(bits
)
100 200 300 400 500 600
F = 6 , m = 3
F = 8, m = 4
F = 10, m = 5
In the figure, F stands for the initial length of the signatures and m for
the initial number of bits set to 1.
Experiment Results
- Test Platform
Computer - DELL desktop PC equipped with Pentium III 864Ghz processor,
512MB RAM and 20GB hard disk.
Database system - Oracle-9i Enterprise Edition, The default buffer cache of
Oracle-9i is of size 4MB.
Language - Oracle PL/SQL language.
Data - all the 37 Shakespeare’s plays in a database
Size
1 2 M B
8 M B
re la ti o n N a m e
< 6 4 K B
< 6 4 K B
E le m e n t
Te x t
A tt r ib u te
S ig n a tu r e
Experiment Results
- Storage of XML documents in databases
All the documents are stored in three tables.
The relation Element has the following structure:{DocID: <integer>, ID: <integer>, Ename: <string>,firstChildID: <integer>, siblingID: <integer>, attributeID: <integer>}