1 Introduction to XML Algebra CS561
1
Introduction to XML Algebra
CS561
2
Data Model data model ~ core data structures and
data types supported by DBMS relational database is a table (set-oriented)
data model XML format is a tree-structured
hierarchical model
3
Why Query Algebra (for XML) ?
It is common to translate a query language into an algebra.
First, the algebra is used to give a semantics for the query language.
Second, the algebra is used to support query optimization.
5
NIAGARA Title : Following the paths of XML Data: An
algebraic framework for XML query evaluation
By : Leonidas Galanis, Efstratios Viglas, David J. DeWitt, Jeffrey. F. Naughton, and David Maier.
Univ. of Wisconsin
6
Outline
Concepts of Niagara Algebra
Operations
Optimization
7
Goals of Niagara Algebra
Be independent of schema information Query on both structure and content Generate simple, flexible, yet powerful
algebraic expressions Allow re-use of traditional optimization
techniques
8
Example: XML Source Documents
Invoice.xml
<Invoice_Document>
<invoice No = 1>
<account_number>2 </account_number>
<carrier>AT&T</carrier>
<total>$0.25</total>
</invoice>
<invoice>
<account_number>1 </account_number>
<carrier>Sprint</carrier>
<total>$1.20</total>
</invoice>
<invoice>
<account_number>1 </account_number>
<carrier>AT&T</carrier>
<total>$0.75</total>
</invoice>
</Invoice_Document>
Customer.xml
<Customer_Document>
<customer>
<account>1 </account>
<name>Tom </name>
</customer >
<customer>
<account>2 </account>
<name>George </name>
</customer >
</Customer _Document>
9
XML Data Model and Tree GraphExample:
Invoice_Document
Invoice Invoice…
numbercarrier total number
carriertotal
2 AT&T $0.25 1 Sprint $1.20
<Invoice_Document> <invoice> <number>2</number> <carrier>Sprint</carrier> <total>$0.25</total> </invoice>
<invoice><number>1</number> <carrier>Sprint</carrier> <total>$1.20</total> </invoice>
</Invoice_Document>
Ordered Tree Graph,
Semi structured Data
10
XML Data Model (for Querying)
SQL: relations in, relation out. Relational Algebra: relations in, relation out.
XQuery: XML doc in, XML docs out XML Algebra: ??
11
XML Data Model [GVDNM01]
Collection of bags of vertices. Vertices in a bag have no order. Example:
Root invoice.xml invoice invoice.account_number
<invoice>Invoice-element-content
</invoice>
< account_number >element-content
</ account_number >
[Root“invoice.xml”, invoice, invoice. account_number ]
12
Data Model
Bag elements are reachable by path expressions.
Path expression consists of two parts:An entry pointA relative forward part
Example: account_number:invoice
13
Outline
Concepts of Niagara Algebra
Operations
Optimization
14
Operators
Source S , Follow , Expose , Vertex ,
Source S , Select , Join , Rename ,
Group , Union , Intersection , Difference - , Cartesian Product .
15
Source Operator S Input : a list of documents Output :a collection of singleton bags
Examples :
S (*) All known XML documentsS (invoice*.xml) All XML documents whose filename match “invoice*.xmlS (*,schema.dtd) All known XML documents that conform to schema.dtd
16
Follow operator Input : a path expression in entry point
notation Functionality : extracts vertices reachable
by path expression Output : a new bag that consists of the
extracted vertex + all contents of original bag (in case of unnesting follow)
17
Follow operator (Example*)
Root invoice.xml invoice
<invoice>Invoice-element-content
</invoice>
Root invoice.xml invoice invoice.carrier
<invoice>Invoice-element-content
</invoice>
<carrier>carrier -element-content
</carrier >
(carrier:invoice)*Unnesting Follow
{[Root invoice.xml , invoice]}
{[Root invoice.xml , invoice, invoice.carrier]}
18
Select operator Input : a set of bags Functionality : filters the bags of a
collection using a predicate Output : a set of bags that conform to the
predicate Predicate : Logical operator (,,), or simple
qualifications (,,,,,)
19
Select operator (Example)
invoice.carrier =Sprint
Root invoice.xml invoice<invoice>
Invoice-element-content</invoice>
Root invoice.xml invoice<invoice>
Invoice-element-content</invoice>
Root invoice.xml invoice<invoice>
Invoice-element-content</invoice>
{[Root invoice.xml , invoice], [Root invoice.xml , invoice], ……………}
{[Root invoice.xml , invoice],… }
20
Join operator Input: two collections of bags Functionality: Joins the two collections
based on a predicate Output: the concatenation of pairs of
pages that satisfy the predicate
21
Join operator (Example)
Root invoice.xml invoice<invoice>
Invoice-element-content</invoice>
Root customer.xml customer<customer>
customer-element-content</customer>
account_number: invoice =number:customer
Root invoice.xml invoice Root customer.xml customer<invoice>
Invoice-element-content</invoice>
<customer>customer-element-content
</customer>
{[Root invoice.xml , invoice]} {[Root customer.xml , customer]}
{[Root invoice.xml , invoice, Root customer.xml , customer]}
22
Expose operator
Input: a list of path expressions of vertices to be exposed
Output: a set of bags that contains vertices in the parameter list with the same order
23
Expose operator (Example)Root invoice.xml invoice. bill_period invoice.carrier
<invoice>carrier-element-content
</invoice>
<carrier>bill_period -element-content
</carrier >
(bill_period,carrier)
{[Root invoice.xml , invoice.bill_period, invoice.carrier]}
Root invoice.xml invoice invoice.carrier invoice.bill_period
<invoice>Invoice-element-content
</invoice>
<carrier>bill_period -element-content
</carrier >
{[Root invoice.xml , invoice, invoice.carrier, invoice.bill_period]}
<invoice>carrier-element-content
</invoice>
24
Vertex operator Creates the actual XML vertex that will
encompass everything created by an expose operator
Example :
(Customer_invoice)[((account)[invoice.account_number], (inv_total)[invoice.total])]
25
Other operators Group : is used for arbitrary grouping of
elements based on their values Aggregate functions can be used with the
group operator (i.e. average) Rename : Changes entry point
annotation of elements of a bag. Example: (invoice.bill_period,date)
26
Example: XML Source Documents
Invoice.xml
<Invoice_Document>
<invoice>
<account_number>2 </account_number>
<carrier>AT&T</carrier>
<total>$0.25</total>
</invoice>
<invoice>
<account_number>1 </account_number>
<carrier>Sprint</carrier>
<total>$1.20</total>
</invoice>
<invoice>
<account_number>1 </account_number>
<total>$0.75</total>
</invoice>
<auditor> maria </auditor>
</Invoice_Document>
Customer.xml
<Customer_Document>
<customer>
<account>1 </account>
<name>Tom </name>
</customer >
<customer>
<account>2 </account>
<name>George </name>
</customer >
</Customer _Document>
27
Xquery ExampleList account number, customer name, and invoice
total for all invoices that have carrier = “Sprint”.
FOR $i in (invoices.xml)//invoice,
$c in (customers.xml)//customer
WHERE $i/carrier = “Sprint” and
$i/account_number= $c/account
RETURN
<Sprint_invoices>
$i/account_number,
$c/name,
$i/total
</Sprint_invoices>
28
Example: Xquery output
<Sprint_Invoice>
<account_number>1 </account_number>
<name>Tom </name>
<total>$1.20</total>
</Sprint_Invoice >
29
Algebra Tree Execution
customer (2) customer(1) Invoice (1) invoice (2) invoice (3)
Source (Invoices.xml) Source (cutomers.xml)
Follow (*.invoice) Follow (*.customer)
Select (carrier= “Sprint” )
invoice (2)
Join (*.invoice.account_number=*.customer.account)
invoice(2) customer(1)
Expose (*.account_number , *.name, *.total )
Account_number name total
30
Outline
Concepts of Niagara Algebra
Operations
Optimization
31
Optimization with Niagara
Optimizer based on Niagara algebra:
Use the operation more efficiently Produce simpler expressions by
combining operations
32
Language Convention A and B are path expressions A< B -- Path Expression A is prefix of
B AnB --- Common prefix of path A and B AńB --- Greatest common prefix of path A and B ┴ --- Null path Expression
33
Heuristics using Rewrite Rules
Allow optimization based on path selectivity
When applying un-nesting with operation Φμ
34
Φμ(A) [Φμ(B)]=Φμ (B)[Φμ (A)]
TRUE or FALSE?
TRUE when
exists C such that C < A && C < B and C = AńB
Or AnB = ┴
Interchangeability of Follow operation
35
Application of Rule on Invoice
Φμ(acc_Num:invoice)[Φμ(carrier:invoice)]
==
Φμ(carrier:invoice)[Φμ(acc_Num:invoice)] ?
TRUE or FALSE?
36
Application of Rule on Invoice
Φμ(acc_Num:invoice)[Φμ(carrier:invoice)]
=Φμ(carrier:invoice)[Φμ(acc_Num:invoice)]
TRUE because both share common prefix “invoice”.
Case AńB = invoice
37
Benefit of Rule Application NOTE: Assume acc_Num is required for each
invoice element, while carrier is not
THEN:Φμ(acc_Num:invoice)[Φμ(carrier:invoice)]
==
Φμ(carrier:invoice)[Φμ(acc_Num:invoice)]
Then what algebra tree do we prefer?
38
Discussion
Reduction of Input Size on first
Sub-operation:
Φμ(carrier:invoice)
vsΦμ(acc_Num:invoice) (:
39
Can we apply the rule below?
Φμ(acc_Num:invoice)[Φμ(acc_Num:Customer)]
40
“acc_Num:invoice” and
“acc_Num:customer”
are two totally different paths
Case is: AnB = ┴
So yes, rule is valid.
Example
41
Summary
XML Algebra
Operations
Optimization