Top Banner
NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian Popa, IBM Almaden
78

NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

Dec 18, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

1

XML Query Reformulation

Val Tannen

University of Pennsylvania

Joint work with Alin Deutsch, UC San Diego

and in part with Lucian Popa, IBM Almaden

Page 2: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

2

Data Exchange Between Businesses Using XML

XML

XMLXML

proprietary data

proprietary data

published data

proprietary data

published data

published data

published data

hospital

insurance company pharmaceutical company

Page 3: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

3

XML?

<drug> <name>aspirin</name> <price>$4</price> <notes> <side-effects>upset stomach</side-effects> <maker>Bayer</maker> </notes></drug>

drug

name price notes

side-effects maker“aspirin” “$4”

“upset stomach”

“Bayer”

opening tag

matching closing tag

text

Page 4: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

4

A Simple Publishing Scenario

usage drug name

2/day aspirin John

3/day cortisone Jane

name diagnosis

John migraine

Jane allergy

prescription patient

<study> <case> <diag>migraine</diag> <drug>aspirin</drug> <usage>2/day</usage> </case> <case> <diag>allergy</diag> <drug>cortisone</drug> <usage>3/day</usage> </case></study>

published data

proprietary data

patient name is hidden

client

client query(XQuery)

correspondenceexpressed by

publishing query(view)

reformulation(SQL)

virtual data

View = query which, if executed, would produce the virtual data

XML query language standard (draft)

How to express the view?

How to “compose” the client query with the view,

obtaining the reformulation?

Page 5: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

5

completeness

soundness

The General Problem of Query Reformulation

schema P schema S

schema correspondence

client

query Q(P) ? reformulated query X(S)

Given query Q(P), find query(ies) X(S) returning same answer,

whenever such X(S) exists

Page 6: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

6

Applications of Query Reformulation

• data publishing

• data integration

• schema evolution

• data security illustrated next

we just saw it:public schema / storage schema

global schema / local schema

old schema / new schemaP

P

P

S

S

S

Page 7: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

7

An Application: Data Security

public schema P

proprietary schema Sschema

correspondence

client

query E(S)(exposes secret data correlation)

Only possible if Completeness Property holds!

intrusive query I(P)

Want to be sure that there is no I(P) returning same answer as E(S)

(patient,ailment)

(patient, physician)+

(physician, ailment)

Page 8: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

8

More Complicated Data Publishing:Mixed And Redundant Storage (MARS)

initial configuration

view of proprietary data

may hide information

published XML(virtual)

proprietary XML data

proprietary relational data

storage schema

public schema

schema correspondence

after tuning

redundant data

materialized views, indexes

cached queries

partial relational storage of XML

Page 9: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

9

An Example With Tuning

relational DBXML

XML XML

iden

tity

view

simple publishing view

drug,price,notes

rel DBrelational view

drug,price drug,usage,name name,diagnosis

XML

cac

hed

quer

y

diagnosis,drug

drug,usage,diagnosis

Page 10: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

10

Redundancy Enables Multiple Reformulations

Relational DBXML

XML XML

iden

tity

view

simple publishing view

drug,price,notes

Rel DBrelational view

drug,price drug,usage,name name,diagnosis

XML

cac

hed

quer

y

diagnosis,drug

drug,usage,diagnosis

client query: “find how much each treatment costs”

R2R1R3

Some reformulations are potentially cheaper to execute than others. Want to find an “optimal” one!

Page 11: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

11

XQuery XQuery

Schema Correspondence Expressible in XQuery

relational DBXML

XML XML

rel DBXML

XML

encode

XML

encode

XQuery XQuery

The DB administrator must be able to specify the correspondence.

Can use XQuery, fixing any of the common encodings of relational tables in XML.

Page 12: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

12

XQuery?

for $d in document/drug, $m in $d//maker

return <producedBy>$m/text()</producedBy>

drug

name price notes

side-effects maker“aspirin” “$4”

“upset stomach”

“Bayer”

Result should contain

<producedBy>Bayer</producedBy>

binding part

tagging template

// (descendant)is the transitive closure of / (child)

Page 13: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

13

Approach: XQuery Reformulation Reduced to Relational Reformulation

reformulated queries (multiple solutions)

client XQuery

Mappings ()

as XQueries

schemacorrespondence

GReXbuilt-in relational constraintscapture XML data model

XML integrityconstraints

= compilation

GReX: Generic Relational encoding of XML

relational queries

C&B

reformulated queries

relational constraints

Page 14: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

14

XQuery Semantics

XML data model is a tagged tree

<drug> <name>aspirin</name> <price>$4</price> <notes> <side-effects>upset stomach</side-effects> <maker>Bayer</maker> </notes></drug>

drug

name price notes

side-effects maker“aspirin” “$4”

“upset stomach”

“Bayer”

XQueries compute in two stages:

navigation in XML tree, binds variables to

nodes, text, tags, etc.

output of new XML, by filling in variable bindings into a

tagging template

for $d in document/drug, $m in $d//maker

return <producedBy>$m/text()</producedBy>

Variable binding stage

tagging stage

“$d” “$m”

Page 15: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

15

Compiling the Binding Part of XQueries to Relational Queries

Relational query over

child(x,y) , tag(x,t) ,desc(x,y) , Root (r), etc.

XBind query =

binding part of XQuery

(returns a relation:

tuples of variable bindings)

compiles to

P($d,$m) :- Root(r) , child(r,$d) , tag($d,“drug”) ,

desc($d,x) , child(x,$m) , tag($m,“maker”)

But not all models of this schema correspond to the intended model; need GReX !

Example:

for $d in document(“drugs.xml”)/drug, $m in $d//maker return “$d” “$m”

a relational “conjunctive” query

Page 16: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

16

Sample Constraints from GReX

• Relationship between child and descendant navigation:

xy [ child(x,y) desc(x,y) ] desc contains child

x [ el(x) desc(x,x) ] desc is reflexive

xyz [ desc(x,y) desc(y,z) desc(x,z) ] desc is transitive

• Tagged tree structure of XML:

rx [ root(r) desc(x,r) x = r ] root has no ancestors

xyz [ child(x,z) child(y,z) x = y ] at most one parent

These do not capture transitive closure completely, nor is it possible to do it in first-order logic; STILL...

Page 17: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

17

More Constraints from GReX

(some Tag) x [ el(x) t tag(x,t) ] every element has a tag

(oneTag) xt1t2 [ tag(x,t1) tag(x,t2) t1 = t2 ] one tag per element

(noLoop) xy [ desc(x,y) desc(y,x) x = y ] no non-trivial cycles

(noShare) xyuv [ child(x,u) child(x,v) unique path between

desc(u,y) desc(v,y) u = v ] elements

(inLine) xy [ desc(x,u) desc(y,u) ancestors of an element

x = y desc(x,y) desc(y,x) ] are collinear

Page 18: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

18

Which Reformulations Do We Find This Way?

reformulated queries (multiple solutions)

client XQuery

Mappings ()

as XQueriesschema

correspondencerelational queries

C&B

reformulated queries

relational constraints

GReXbuilt-in constraintscapture XML data model

XML integrityconstraints

= compilation

all of them?

Page 19: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

19

Restrictions on XQuery

Main restriction: no aggregates (to be investigated)

Leaving out aggregates, most common queries can be processed.

Minor restrictions:

no user-defined functions (of course!)

limited use of negation (or else the problem becomes undecidable)

limited use of document order (to be investigated)

no navigation to parent or wildcard child (of unspecified tag) (unintuitive, but we can show that this needs another algorithm,

unless NP= 2)p

Page 20: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

20

The Reduction is Sound and Complete

For the restricted XQuery fragment,

Given:

- XBind query B compiled to a relational query c(B)

- schema correspondence C given by XQueries compiled to set of constraints c(C)

Relative Completeness Theorem:

R is a minimal reformulation of B under C

iff c(R) is a minimal reformulation of c(B) under c(C) and GReX

All of them are found by C&B.

R can be computed from c(R)

Page 21: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

21

A Glimpse at the Chase:Transforming Queries Using Constraints

AQ:

A query: ‘ find data satisfying condition “A” ‘

A constraint: ‘ whenever the data satisfies condition “A”, it also satisfies “B” ‘

A B

A chase step:

AQ: A BQ1:

The chase: repeatedly applying chase steps until no new conditions can be addedIn general, Q and Q1 are not equivalent,

but in all DBs satisfying the constraint, they are!Theory of the chase: 20 years old, deep and rich, due to Beeri, Maier, Mendelson, Sagiv, Vardi, Yannakakis and others!

Page 22: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

22

How Do We Use the Chase?Capturing Relational Views With Constraints

Let the schema correspondence be the view:

‘ retrieve the data satisfying conditions “A” and “B” ‘

V: A B

Capture the definition with constraints (first-order logic statements)

VA B V A B

all data satisfying “A” and “B” “appears in result of V”

all data “appearing in V”satisfies “A” and “B”

stands for condition:

“data appears in result of V”V

Page 23: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

23

Chase & Backchase

First chase:

It turns out that SQ is equivalent to Q

Presence of constraint A B allows reformulation

SQ: V

Next inspect all subqueries (“syntactic pieces”) of the chase result Q2:

AQ: AQ1: BA B

AQ2: B VVA B

The equivalence is checked again using the chase (backwards)

SQ: V AQ2: B VV A B

Page 24: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

24

General C&B Algorithm (joint work with Lucian Popa, IBM Almaden)

(public) schema P , (proprietary) schema S

Let C be a set of constraints. (eg., on P and/or P & S )

Q(P)

U(P + S )

chas

ew

ith

C

S U B Q U E R I E S

backchase

solutions X(S) = subqueries of U,posed against S, equivalent to Q

Universal plan

Completeness Theorem [Deutsch&T.]: Any scan-minimal reformulation of Q under C is a subquery of U

Assume some terminatingchasing sequence

Page 25: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

25

Two Sets of Experiments

• Synthetic queries reformulation time as function of query “complexity”

XML analog of relational “star” queries, increasing number of joins

can very complex queries still be reformulated in a practical amount of time ?

• “Realistic” queries from the XML Benchmark Project [http://monetdb.cwi.nl/xml]

The Queries: 20 queries designed to exercise interesting features of XQuery

The Schema correspondence: views in both directions compiles to about 200 constraints!

Much more than in typical relational schemas!

Page 26: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

26

Experiments with Synthetic Queries

Number of joins (number of corners in the star)

Page 27: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

27

Experiments with Benchmark Queries

Reformulation times must be understood in conjunction with execution times(eg., tens of seconds for Q10)

Page 28: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

28

Summary of Contributions

MARS, a system for XQuery reformulation,

- with mixed and redundant storage, under integrity constraints.

- complex schema correspondence (views in both directions)

Showed practical relevance of C&B method (feasible and worthwhile)

A completeness result for a significant fragment of XQuery and a large

class of schema correspondences. The method remains sound for the full language.

A reduction between minimal reformulation and query equivalence, and

we gave matching lower bounds showing our chase-based decision procedure is

asymptotically optimal for the fragment considered.

Page 29: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

29

Page 30: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

30

Why XML?

The relational data model is still the dominant concept in databases.

All data can be coded into tables. (For that matter into (goedel)numbers too!)

Artificial coding makes life harder for query programmers.Result: less productivity, more bugs.

XML is much more flexible. It is also “self-describing”, i.e., noneed apriori for types/schemas (but this is sometimes a bad idea).

It came from the document community (tagged text) and was cheered by industry gurus. So we have to live with it.(Although one can image better data models…)

Page 31: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

31

1. Cost-independent: prune subqueries that - do not correspond to legal XML queries - contain redundant descendant navigation steps

Making It Work

Chase: each chase step is similar to evaluation of a recursive Datalog rule on a symbolic database built from the query

we borrowed classical query processing techniques

typical size reduction

2^100 300

Backchase: size of search space is O(2^u), u = size of universal plan We found criteria for pruning this space.

2. A cost-based pruning strategy parameterized by costing modelPerform contiguous navigation steps starting from the rootx child-of y, y child-of z, x descendant-of z

• compiling constraints to join tree• joins implemented as hash-joins• pushing selections into joins

bottom-up exploration of subqueries: first all performing 1 navigation step, next all performing 2 navigation steps, etc.

- finds optimal reformulation for any monotonic cost model

- cost models for XML are still under research

- heuristic cost model: cost is number of table scans/XML navigation steps performed

- amenable to experimenting with other cost models

Page 32: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

32

Benefit of Reformulation For Execution Time

Benefit increases with increasing complexity of queryand increasing database size

original query execution - time to reformulate - execution of reformulation

-100

0

100

200

300

400

500

600

3 4 5 6 7

number of major joins per query

save

d

tim

e (s

) 60

80

90

100

150

200

no. of elements

in document

Page 33: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

33

For redundancy: materialized the XBind query for each query

(particular case of Acess Support Relation)

reformulation times (with redundancy and optimization)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Q1

Q2

Q3

Q4

Q5

Q6

Q7

Q8

Q9

Q10

Q11

Q12

Q13

Q14

Q15

Q16

Q17

Q18

Q19

Q20

queries

tim

e (s

)

time to first reformulation delta to best reformulation delta to finish search

More Results for Benchmark Queries

Time to find first reformulation is essentially the same as in the absence of redundancy.

Additional time spent only for finding optimal one.

Time to first reformulation

Delta to best reformulation

Delta to finish search

Page 34: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

34

Related Work:Data Integration As Particular Case of MARS Applications

P

S

QX=Q o CR

(global schema)

(local schema)

Global As View (GAV)

reformulation bycomposition-with-views

TSIMMIS, SilkRoute, XPeranto

CR

P

S

QQ=X o CR

Local As View (LAV)

rewriting-with-views

Information Manifold, STORED, Agora

CR

P

S

CR

QCR X = Q

MARS

combined effect ofrewriting+composition

[with Fernandez and Suciu in SIGMOD’99]

Page 35: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

35

Future Work Directions

• Short-Term:

- tuning of C&B implementation for further speedup

- XML-specific strategies for pruning the backchase stage

- in particular, finding a good cost model to perform cost-based pruning

• Medium-Term:

- Applying C&B to Data Security

- Applications to Adaptive Distributed Query Optimization

• Long Term:

- a unified framework for integrating data from various, heterogenous sources going

beyond classical databases (XML/relational/LDAP + web forms + web services)

Page 36: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

36

Application 3: Schema Evolution (e.g. Caching)

old schema O

new schema Nschema

correspondence

client

old query Q (O)

reformulated query X (N)

Find X(N) returning same answer as Q(O)

Goal: support existing client applications even after changing the schema

could be O extended with cached results

Page 37: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

37

catalog

drug drug

namenameprice price

“aspirin” “cortisone” “$50”

A Source of Redundancy: Relational Storage of XML

“$4”

notesnotes

Drugs name price

aspirin $4

cortisone $50

redundant storage

public datarelational view(lossy)

highly unstructured

Page 38: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

38

Containment Under Integrity Constraints

Decision procedure for containment is based on chasing with constraints from GReX.

Natural extension to XML integrity constraints.

Some results:

• Containment of well-behaved XPath/XBind queries under bounded simple XML integrity constraints (SXICs) is decidable (used in relative completeness theorem).

• Even modest use of unboundedness makes the problem undecidable.

• Corollary: containment under bounded SXICs and DTDs is undecidable.

• Containment under DTDs only is an open problem, but we have a PSPACE lower bound.

See proposal for details.

Page 39: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

39

LDAP

Page 40: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

40

Page 41: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

41

The Architecture of Our Solution

reformulated queries (multiple solutions)

client XQuery

Mappings ()

as XQueries

rel/XML

encodings

schemacorrespondence

relational queries

C&B

reformulated queries

relational constraints

GReXbuilt-in XML data model constraints

XML integrityconstraints

= compilation

GReX: Generic Relational encoding of XML, used internally to partially capture the intended model

XBind queries

tagging templatedefined next

not shown here

Page 42: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

42

Problem:

• XML/MARS XQuery Reformulation

• schema correspondence given by views in both directions

• multiple solutions

Tool: Algorithm for reformulation

of relational queries under relational constraints

Chase & Backchase (C&B)

introduced in [VLDB’99 with L. Popa and V. Tannen]

evaluated in [SIGMOD’00 with L. Popa, A. Sahuguet and V. Tannen]

Page 43: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

43

Capturing Relational Views With Constraints

(bV) x z [ V(x,z) y A(x,y) B(y,z) ]result of query defining the view is included in V

V is included in result of query defining view

Let the schema correspondence be a view defined as the relational conjunctive query

V(x,z) :- A(x,y), B(y,z)

Capture the definition with constraints,

(cV) x y z [ A(x,y) B(y,z) V(x,z) ]

Page 44: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

44

Partially capturing the XML model

Partially, because some features cannot fully be captured with constraints:

• descendant is the transitive closure of child, but this is not FO-definable

• neither is the “treeness” property

our solution:

add a set of constraints GREX to approximate intended models

it turns out that capturing descendant helps in capturing treeness

then, we define a significant XQuery fragment (we call it well-behaved)

that cannot distinguish between intended and approximate models

Page 45: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

45

Constraints in GReX (2): the tagged tree structure of XML

(topRoot) rx [ root(r) desc(x,r) x = r ] root has no ancestors

(oneTag) xt1t2 [ tag(x,t1) tag(x,t2) t1 = t2 ] one tag per element

(noLoop) xy [ desc(x,y) desc(y,x) x = y ] no non-trivial cycles

(oneParent) xyz [ child(x,z) child(y,z) x = y ] at most one parent

(noShare) xyuv [ child(x,u) child(x,v) unique path between

desc(u,y) desc(v,y) u = v ] elements

(inLine) xy [ desc(x,u) desc(y,u) ancestors of an element

x = y desc(x,y) desc(y,x) ] are collinear

Page 46: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

46

XQuery Restrictions

What it allows:

composition of navigation steps,

navigation axes: self, (named)child, descendant, ancestor, idrefs

qualifiers: path, string path, “and”, “or”, path equality/inequality

where clause: disjunction, path equality/inequality,

existential quantification

What it rules out:

user-defined functions,

range, before predicates,

aggregates, arbitrary negation, universal quantification,

concatenation (,)

navigation to parent (..) or to child of unspecified name (*)

Page 47: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

47

C&B Completeness

Let C be a set of constraints (relates public schema P and proprietary schema S)

• C-minimal query:

removing any of its relational atoms produces non-equivalent query under D

• Q1 is a subquery of Q2:

Q1 is isomorphic to a “piece” of Q2

Completeness Theorem: Any C-minimal reformulation of Q is a subquery of U

Q(P)

U(P + S)

chas

e

S U B Q U E R I E S

backchase

solutions X(S) = subqueries of U,posed against S, equivalent to Q

Universal plan

Page 48: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

48

A Completeness Result for Our Solution

Given:

- well-behaved XBind query B

compiled to a relational query c(B)

- schema correspondence M given by well-behaved XQueries (in both directions),

compiled to set of relational constraints c(M)

- bounded XML integrity constraints XIC,

compiled to set of relational constraints c(XIC)

Relative Completeness Theorem: for any R

R is a (M+XIC)-minimal reformulation of B

iff

c(R) is a (GReX c(M) c(XIC))-minimal reformulation of c(B)

a class of XML integrity constraints, see [KRDB’01]

All of them are found by C&B. Corollary: completeness of reformulation algorithm for XBind queriesR can be computed from c(R)

Page 49: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

49

Capturing XML Semantics

reformulated queries (multiple solutions)

client XQuery

Mappings ()

as XQueriesschema

correspondencerelational queries

C&B

reformulated queries

relational constraints

GReXbuilt-in constraintscapture XML data model

XML integrityconstraints

= compilation

Page 50: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

50

Summary of Constraints Used in C&B Phase

• Built-in constraints in GReX

• Relational views compile to inclusion constraints

• XQuery views

– their XBind queries compile to inclusion constraints as for relational views

– their return clause compiles to several decorrelated queries, each captured with constraints

– the XML template in the return clause compiles to several Skolem and copy functions, each compiled to constraints

• Integrity constraints

– XML constraints compile to relational constraints

– relational schema constraints

Page 51: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

51

Are the Restrictions Justified?

Our completeness result holds for well-behaved XQueries, under bounded

XML integrity constraints.

What about reformulating

• XQueries with parent and wildcard child navigation?

• Under other XML integrity constraints?

• Even under full-fledged DTDs?

For such extensions, we make a deeper study of equivalence, which is an even simpler problem in reformulation.

The equivalence checker is invoked as black-box algorithm during C&B.

Page 52: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

52

path concatenation, attribute values

navigation axes: self, (named)child, descendant

qualifiers: path, string path, “and”

XBind (includes XPath) Fragments Equivalence

PTIME

+ join on attribute variables

NP-complete

+ any or all (!) of the following: . disjunction

. ancestor navigation

. path equality

. wildcard child () navigation

+ parent, preceding(following)-sibling

2-completep

In 2

p

well-

beh

aved

sim

ple

Page 53: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

53

Theorem

B1 , B2 XBind/XPath queries from our “well-behaved” fragment

c(B1) , c(B2) their relational compilation

B1 is equivalent to B2 iff

c(B1) is equivalent to c(B2) under GReX

decidable in 2p using chase

Containment for the “well-behaved” fragment of XBind/XPath

This result about containment is used in the relative completeness theorem

Page 54: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

54

Extensions of the “NP” fragment: 2p fragments

any or all (!) of the following make equivalence 2p-complete:

• disjunction

unsurprising: conjunctive queries+union already 2p-complete [SY’80]

• ancestor navigation

translate ancestor away introducing union: /a/b/ancestor /[a/b] /a[b]

• path equality qualifier

can simulate ancestor: //.[.//.==/p]/s /p/ancestor/s

• wildcard child navigation

union introduced by interaction //: //a /a ///a

Not well-behaved, but we have a different decision procedure

Page 55: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

55

Experimental Setup: Started From the XML Benchmark

Used the official XML Benchmark Project [http://monetdb.cwi.nl/xml]

The application domain: an online auctioning application.

The published schema: a DTD given by the XML Benchmark Project

Data is partially nicely structured.

The Queries: 20 queries designed to exercise interesting features of XQuery

Page 56: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

56

What We Added to the XML Benchmark Setup

Much more than in typical relational schemas!Had to change original implementation [SIGMOD’00] to scale.

The mixed storage schema:

relationally: person, item, open auction, closed auction, etc.

unstructured part: annotations on items

The redundancy:

materialized the XBind query for each query

(particular case of Acess Support Relation)

The mappings:

in both directions: relations XML, XML XML

It all compiles to about 200 constraints !

Page 57: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

57

Related Work

Publishing systems

Schema mapping proprietary relational published XML: SilkRoute, Xperanto

reformulation by composition-with-views.

Schema mapping published XML proprietary relational : STORED, Agora

reformulation by rewriting-with-views

Information Integration

TSIMMIS (composition-w-views), Information Manifold (rewriting-w-views)

Containment

Miklau and Suciu, smaller fragment of XPath(they too find that * is “naughty”

[FLS, CGLV] - conjunctive regular path queries

Amer-Ahia and Srivastava - minimization of tree pattern queries

Containment under integrity constraints XML keys [BDFHT]; description logics [CGL];

Page 58: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

58

Query Reformulation in Data Publishing

public schema P (virtual data)

proprietary storage schema S(materialized data)

publishing query (may hide some proprietary data)

client query Q(P)(not directly executable)

partner/client

? reformulated queryX(S)

Find X(S) returning same answer as Q(P)

schema = interface against which queries are formulated

Page 59: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

59

Compiling the Binding Part of XQueries to Relational Queries

Relational query over

child(x,y),tag(x,t),desc(x,y),Root(r), etc.

XBind query = binding stage

of XQuery

(returns a relation:

tuples of variable bindings)

But, over arbitrary DBs with this schema, the relational translation of

Root desc desc is not equivalent to that of Root desc

Navigation in XQueries Relational join of tables child, tag,etc.

must communicate to the C&B that desc table is transitive

Page 60: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

60

The Challenge for “Reformulation on MARS”

To find the reformulations efficiently, we need to

• reason with schema correspondence

• efficiently construct the search space for reformulations

- must contain all reformulations (for completeness)

• explore search space

- exhaustively (for security applications)

- maybe trading optimality of reformulation for search speed

(for optimization purposes)

Page 61: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

61

Contributions

• A novel algorithm for reformulation of relational queries under relational constraints

– Chase & Backchase

Uses this semantics and exploits C&B

[VLDB’99 with Popa and Tannen][SIGMOD’00 with Popa, Sahuguet and Tannen]

• MARS: a system for XQuery reformulation over Mixed And Redundant Storage

–constructs and represents search space efficiently

–cost-based exploration strategy parameterized by traditional costing module

–finds first reformulation fast

• Experimental evaluation: time to first reformulation, simple cost

• A declarative semantics for most of XQuery

• A reformulation algorithm for XQuery

–practical (feasible and worthwhile)

–complete for “most” of XQuery

–optimal (we show lower bounds for various XQuery fragments: KRDB’01, DBPL’01)

Page 62: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

62

Compiling Client XQueries

reformulated queries (multiple solutions)

client XQuery

Mappings ()

as XQueriesschema

correspondencerelational queries

C&B

reformulated queries

relational constraints

GReXbuilt-in constraintscapture XML data model

XML integrityconstraints

= compilation

Page 63: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

63

Capturing the Schema Correspondence

reformulated queries (multiple solutions)

client XQuery

Mappings ()

as XQueriesschema

correspondencerelational queries

C&B

reformulated queries

relational constraints

GReXbuilt-in constraintscapture XML data model

XML integrityconstraints

= compilation

Page 64: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

64

Major Obstacles in Compiling Schema Mappings to Constraints

Schema correspondence given by XQueries. As opposed to relational queries,

• XQueries have nested, correlated subqueries in return clause

• XQueries create new elements

• XQueries return deep, recursive copies of input XML trees

(solution not shown)

Page 65: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

65

Compiling Nested Subqueries: Decorrelation

the query

for $p in doc(“foo.xml”)//person

return <res>$p/phone/text()</res>

compile XBind parts to two decorrelated relational queries (shown here in Datalog syntax):

Bouter(p) Root(r), desc(r,x), child(x,p), tag(p,”person”)

Binner(p,t) Bouter(p), child(p,n), tag(n,”phone”), text(n,t)

capture each with two inclusion constraints, as done in original C&B method

is short for the nested query

for $p in doc(“foo.xml”)//person

return <res>for $t in $p/phone/text()

return $t

</res>

Page 66: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

66

Capturing Creation of New Elements

for $p in doc(“foo.xml”)//person

return <res>$p/phone/text()</res>

For each binding of $p, a distinct <res>-element is constructed.

set of bindings for $p, Bouter <res>-elements in resultF

injective function

Capture F by the relation G representing its graph, and the constraints:

pr1r2 [ G(p,r1) G(p,r2) r1=r2 ] ( r = F(p) )

p1p2r [ G(p1,r) G(p2,r) p1=p2 ] ( F is injective )

p r [ G(p,r) Bouter(p) ] (F’s domain is included in Bouter)

p [ Bouter(p) r G(p,r) ] (Bouter is included in F’s domain)

F is the Skolem function that validates this constraint

Page 67: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

67

Stratified-Witness Constraints(with L.P.)

Full dependencies: no existential quantifier. The chase always

terminates.

Beyond this? Given set C of dependencies --> define chase flow graph:

Nodes correspond to relation components: an R or arity 3 produces 3 nodes.

Edges are drawn between i’th of R and j’th of S iff R appears on the left

side and S appears on the right side of the implication of some dependency.

The edge is labeled if the corresponding variable in S is existentially

quantified. C is stratified-witness if there is no cycle with an -labeled edge

Proposition

The chase with stratified-witness constraints always terminates.

Page 68: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

68

(Relational) Conjunctive Queries

Q(x,z) R(x,y,z) , R(y,x,u) , S(z,u)

selectselect r1.A , s.A

from R r1 , R r2 , S s

where r1.A=r2.B and r1.B=r2.A and

r1.C=s.A and r2.C=s.B

notation: r stands for r1 , … , rn

queries: selectselect O(r) from R r where C(r)

Page 69: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

69

(Relational) Dependencies a.k.a Integrity Constraints

(rR) [ B(r) (sS) C(r,s) ]

B and C are conjunctions of equalities, as in where clause

example:

(r1R)(r2R) [r1.E= r2.E

(sR) s.D= r1.D s.E= r1.E s.F= r2.F ]

Page 70: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

70

Query Containment and Dependencies

Q1 selectselect O1(r1) from R1 r1 where C1(r1)

Q2 selectselect O2(r2) from R2 r2 where C2(r2)

define cont(Q1,Q2) as

(r1R1) [ C1(r1)

(r2R2) C2(r2) O1(r1)=O2(r2) ]

we have, in each instance

Q1 Q2 iff cont(Q1,Q2)

Page 71: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

71

And Viceversa

d (rR) [ B(r) (sS) C(r,s) ]

front(d) = selectselect r

from R r where B(r)

back(d) = selectselect r

from R r , S s where B(r) C(r,s)

we have, in each instance

d iff front(d) back(d)

Page 72: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

72

Chase Step

d (rR) [ B(r) (sS) C(r,s) ]

select O(r) select O(r)

from R r from R r , S s

where B(r) where B(r) C(r,s)

basic fact: Q Q’ Q =d Q’

the chase step is applicable if Q’ is not trivially

equivalent to Q

(for example, we cannot chase Q’ with d ! )

d

d

Page 73: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

73

Using the Chase

basic fact: if chase step of Q with d is not applicable

then Inst(Q) d

( canonical instance Inst(Q) built from query Q )

Basic Theorem

D set of dependencies

Q1 . . . chaseD(Q1) terminating chase sequence

(no more applicable steps) Then:

Q1 D Q2 iff chaseD(Q1) Q2

Page 74: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

74

Reformulation with Views

a view is just a query:

V select O(r) from R r where C(r)

Reformulation of query Q(R) with view V :

finding X(R,V) such that Q(R) =V X(R,V)

Page 75: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

75

One View =Two Dependencies

V select O(r) from R r where C(r)

the “chase-in” dependency:

cV (rR) [ C(r) (xV) x=O(r) ]

the “backchase” dependency:

bV (xV) (rR) C(r) x=O(r) ]

It turns out that

if rewritings of Q with V exist then such a

rewriting can be obtained by chasing Q with cV

Page 76: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

76

The Chase and Backchase (C&B) Algorithm(joint work with Lucian Popa, IBM Almaden)

The chase with cV always terminates.

The search space for rewritings of Q with V consists

of the subqueries of chasecV(Q).

( S is a subquery:

injective homomorphism from S to chasecV(Q) )

Keep only subqueries such that S V chasecV(Q)

This can be checked by (back!)chasing with cV, bV

(also terminating)

Page 77: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

77

Preliminary Completeness Result for C&B(with L.P.)

Theorem Any scan-minimal reformulation of Q with V

is a subquery of chasecV(Q).

scan-minimal: no scan (from item) can be removed

without compromising equivalence with Q.

Fewer scans means faster execution under most cost models.

Page 78: NTUA April 17, 2003 1 XML Query Reformulation Val Tannen University of Pennsylvania Joint work with Alin Deutsch, UC San Diego and in part with Lucian.

NTUAApril 17, 2003

78

Additional Integrity Constraints

In general the storage schema contains integrity constraints

that restrict its class of instances (models). This may extend

the set of reformulation solutions!

Let C be a set of dependencies

Reformulating query Q(R) with view V under C :

finding X(R,V) such that Q(R) =V,D X(R,V).

That’s the same as reformulating Q under C + cV + bV

Can we still use the chase?