Top Banner
Semistructured data -- Ju ne 2001 1 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA [email protected] [email protected] http://www-rocq.inria.fr/verso http:// www.xyleme.com
56

Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA [email protected] [email protected].

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

Semistructured data -- June 2001 1

Semistructured data:from practice to theory

Serge AbiteboulINRIA & Xyleme SA

[email protected] [email protected]://www-rocq.inria.fr/verso http://www.xyleme.com

Page 2: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

2

Organization

• Motivations• XML• Typing XML• Querying XML• XML and the Web• Illustrations: 2 problems

– Incomplete information– Xyleme

• Conclusion

Page 3: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

Semistructured data -- June 2001 3

Motivations

Page 4: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

4

Motivation: Complex data

• Structure is irregular (missing/extra data…)

• Schema does not exist or is unknown

• Schema is rapidly evolving

• Relational and ODB models are too rigid

• Example: BibTex, HTML, SGML, XML, ASN.1, STEP/Express…

Page 5: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

5

Complex data: mediation

Source

wrapper

Source

wrapper

Source

wrapper

Source

wrapper

Source

wrapper

Source

wrapper

MediatorOntologymeta-data

User

Many data sources coming and going

Page 6: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

6

Motivations: The Web today

• Terabytes of data

• Private web: not publicly available pages

• Deep web: data hidden behind forms

• A lot of public pages

• Standard is a document/hypertext language HTML

Page 7: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

7

The Web today

• Browsing

• Search engines – in: list of words

– out: sorted list of URLs

• Applis: hand-made wrappers– Expensive

– Incomplete

– Short-lived, not adapted to the Web constant changes

[Raghavan ’00]

Page 8: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

8

A new standard XML

• HTML is not appropriate for data exchange on the Web

• Standard database models are too constraining for the Web

• The solution: a semistructured data model XML – Reminder: a data model consists of a type definition

language, a query/update language + more

Page 9: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

Semistructured data -- June 2001 9

The most successful semistructured data model: XML

Page 10: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

10

The origin of XML

• Parents– SGML– Relational and OO databases

• SGML: markup language for documents• HTML and the Web: billions of pages• Not appropriate for data exchange• XML eXtensible Mark-up Language

– W3C and most industrial companies [B2B]– Main idea: separate content and presentation– Use tags to represent structure and semantics

Page 11: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

11

XML: documents + databases

• HTML XML– comes from SGML – also– hypertext language – semistructured data

– fixed number of tags – not fixed– content and presentation – not mixed

are mixed– very difficult to extract data – much easier

from a page – old standard for the Web – new standard

Page 12: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

12

HTML = Hypertext Language

Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99

Information System

HTML

The <b> X23 </b> new camera replaces the <b> X22 </b>. It comes equipped with a flash (worth by itself <i>53.99 $</i>) and provides great quality for only <i>359.99 $</i>.

Text + presentationWhere is the data ?

hard

Page 13: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

13

XML = Semistructured Data

Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99...

Information System

<product-table>< product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description></product>< product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description>...</product-table> XML

Data + StructureSemistructured: more flexible

easy

Page 14: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

14

XML: example

dealer

UsedCars NewCars

ad ad

model year model

<dealer> <UsedCars> <ad> <model>Honda</model> <year>96</year></ad> </UsedCars> <NewCars> <ad> <model>Acura</model> </ad> </NewCars> <NewCars> <ad> <model>R406</model> </ad> </NewCars></dealer>

Honda 96 Acura

It is just an unranked tagged ordered tree

NewCars

ad

model

R406

Page 15: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

15

XML

• Tree or graph• Data and structure/semantics are mixed

– Tags contain typing information

• Core constructor is list of tag/value pairs• Details

– Each node may have an arbitrary number of children with distinct or not tags

– Nodes also have attributes that are unordered and unique per node

– Standard means to represent cyclic data: Id Idrefs

Page 16: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

16

XML

Very active/noisy field - standards

– types (DTD/XML schema), style-sheet (XSL), resource description (RDF...)

– DOM, SAX…

– WML (wap), MathML, SMIL (multimedia), RSS (news), RDF (metadata)...

• How fast will XML conquer the web?

– so far rather slow (about 1% now of the visible web; much more in intranets); accelerates (e.g., with Explorer 5.5)

Page 17: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

Semistructured data -- June 2001 17

Typing XML

Page 18: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

18

Typing XML

• This is heresy for the freedom of the Web• Essential for data management: query

optimization, user interfaces, applications• Differences with standard database typing

– Collections are sequences instead of sets– Types may be very large (e.g., from integration)– Data is more irregular so types should be more

permissive– New issues sometimes: you have the data, extract its

type, an approximate type

Page 19: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

19

Intuition : the type is a tree

• Semantics and structure are in paths– dealer/UsedCars/ad– dealer/UsedCars/ad/model

dealer

UsedCars NewCars

ad

model year model

text text

ad

text

Page 20: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

20

DTD: a grammar

Catalog Product*Product Name Price? Cat (Part Quantity)*Part BasicPart + ComposedPartBasicPart PameComposedPart Name (Part Quantity)*

• Nice and simple• Shortcoming: type of an element is independent of

its context

Page 21: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

21

More complex: specialization

• Type of ad depends on its context

dealer

UsedCars NewCars

adused adnew

model year model

dealer

UsedCars NewCars

ad ad

model year model

• One way to view it: homomorphism

Page 22: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

22

Regular tree automata

• Set of accepted trees: regular tree languages

• Definable in monadic second-order logic

p q

r r s s

qf qf qf qf qf qf

q0

Acceptance: there is a computation such thatall leaves are labeled qf

Used New

ad ad ad ad

m y m y m m

dealer

• variants: top/down bottom/up, nondeterminism, unranked trees

Page 23: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

23

DTDs+specialization

Result: DTDs+specialization = regular tree languages

• Closure (intersection, union, complement)

• Tests for validation, inclusion

• Static analysis

Page 24: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

24

Situation today

• Many people are using DTDs– Nice and simple in spite of ugly syntax

• New proposal: xml-schema– More powerful but too complicated?

• Other proposals: Relax, Trex– Usually based on some kind of regular tree automata

• From experience: one will win and not necessarily the best

Page 25: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

Semistructured data -- June 2001 25

Query languages for XML

Page 26: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

26

Query Languages for XML

• Extensions of SQL– first-order-logic

– Information retrieval keyword search

– Navigation via regular expression + pattern matching

Lorel, XML-QL, XMAS…

• Structural recursionUnQL, XSLT…

• No official winner – leader is Xquery

Page 27: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

27

Pattern matching

• Tree with variables and constraints

• Pattern matching between the query and the data

• Each match provides a valuation for X,Y,Z

catalogcatalog

productproduct

name price cat=elecname price cat=elec

subcategorysubcategory

<200<200

X Y

Z

Page 28: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

28

Example in Lorel

select <offer> Z/name, P/name, P’/price </offer>from P in catalog/product,

Z in discount_stores/store,Z/storecatalog/product P’

where P/category=“camera” and P/make=“canon” andP’/id = P/id

• Joins like in relational databases• Construction of complex results• Regular expressions for paths (e.g., W/*/name = “Gates”)

Page 29: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

29

What is new in XML queries

• A bit new: limited recursion (like in deductive databases)

• A bit new but no big deal: constructed answers (like in OODB)

• Very new: ordered data • Bothering

– Theoretical base is a bit messy: FO, tree automata, bisimulation

– No yardstick like relational calculus/algebra

Page 30: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

30

Proposal : k-pebble transducers

stack

[milo,suciu,vianu]

Page 31: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

31

k-pebble transducers: result

root

a c

b a a b

a b

Page 32: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

Semistructured data -- June 2001 32

XML and the Web

Page 33: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

33

Why it is the same old story

• Massive amounts of data

• Providers export data, users access data

• Query languages, indexing, optimization

• Database paradigm: still effective on the Web

Page 34: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

34

Why it is not the same old story

Databases• rigid structure• transactions,

concurrency control• data independence• controlled (e.g.,

known cost model)• coherent system, very

polished artifact

The Web• flexible, no schema• flexible protocols

• fuzzy separation• perfect mess (and that’s

why people like it?)• closer to a natural

ecosystem!

Page 35: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

35

The principles of the Web

• The uncertainty principle: you can never be sure of anything or that the data is consistent

• The incompleteness principle: they do not give you all the data you want (but some you don’t want :-)

• The chaos principle: you can rarely assume the existence of some global schema

• The instability principle: everything keeps changingEvery piece of data you got is probably wrong,

incomplete, does not conform to its expected type and is probably already stale

Page 36: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

36

What can be reused?

• Some technology? indexes, B-trees, distributed query processing (concurrency control and transactions not yet)

• Database theory? little – Algebra and rewrite rules for optimization

– Dependency theory

– First order and other logics

– Seems that because of the ordering, it opens the gates for many more tools such as regular/tree languages

Page 37: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

37

Metaphor [AV]: the Web is infinite

• What are the pages pointing to my homepage?– Google solution: milliseconds – stale data– Freeze the Web: weeks to get exact answer– Exact answer: no means to get it

• Leads to reconsider the notion of computation

Page 38: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

38

Computability

• Finitely computable: give the answer in finite time– All pages reached from my HP in less than 3 links

• Eventually computable: each solution is given in finite time; computation may be infinite– All pages reached from my HP

• Not computable– Can my HP be reached starting from my HP?

• Also: approximate, partial, stale, pipelined answers

Page 39: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

39

Tough life: the Web is huge

• Relational calculus/algebra: logspace data complexity (also AC0)

• What is the data complexity of an Xquery of the Web?

• Complexity of computing on the Web– Logspace in the Web?– Need to trade quality for performance

Page 40: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

40

The Web keeps changing

• Classical: versions, temporal queries

• Less classical: monitoring of the Web [Xyleme]– Smart crawling of the Web: flow of docs– Query subscription: query on this flow– Continuous queries

• What is the underlying theory?

Page 41: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

Semistructured data -- June 2001 41

Illustration: incomplete information

Work with Victor Vianu

Page 42: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

42

Example

Access to an electronic catalogAccess to an electronic catalog

Q1: name, subcat, price of electronic products with price Q1: name, subcat, price of electronic products with price less than $200less than $200

Q2: name, pictures of cameras at least pictured onceQ2: name, pictures of cameras at least pictured once

Page 43: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

43

Q1: name, subcat, price of electronic products with price less than 200Q1: name, subcat, price of electronic products with price less than 200

catalogcatalog

cdplayercdplayer

productproduct

canon 120 eleccanon 120 elec

cameracamera

productproduct

nikon 199 elecnikon 199 elec

cameracamera

productproduct

sony 175 elecsony 175 elec

product1product1product2product2

****

missingmissing

Page 44: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

44

Missing data after Q1

product1product1

name price cat picturename price cat picture

subcategorysubcategory

**

product2product2

name price cat picturename price cat picture

subcategorysubcategory

**

!=elec!=elec =elec=elec>200>200

Page 45: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

45

Q2: name, pictures of cameras at least pictured onceQ2: name, pictures of cameras at least pictured once

catalogcatalog

productproduct

canon 120 eleccanon 120 elec

cameracamera

productproduct

nikon 199 elecnikon 199 elec

cameracamera

productproduct

sony 175 elecsony 175 elec

cdplayercdplayer

product2aproduct2amissingmissingproduct2cproduct2c

product2product2** product2bproduct2b**

c.jpgc.jpg akai a.jpg elecakai a.jpg elec

cameracamera

33 33

** product1product1

Page 46: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

46

product1product1

name price name price catcat picture picture

subcategorysubcategory

**

!=elec!=elec

product2product2aa

name name priceprice catcat picture picture

subcategorysubcategory

=elec=elec>200>200

name price name price catcat

product3product3

elecelecproduct2product2bb

namename priceprice catcat picturepicture

**

=elec=elec>200>200

product2product2cc

namename priceprice catcat

subcategorysubcategory

=elec=elec>200>200

subcategorysubcategory!=camera!=camera

subcategorysubcategory!=camera!=camera

no pictureno picture

no pictureno picture

product +product +

Known

data

Missing data

Page 47: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

47

After two queries

• Known information: – Prefix of the real data tree

• Missing information– Complex type

• Q3: name, price, pictures of cameras costing less Q3: name, price, pictures of cameras costing less than $100 and at least pictured oncethan $100 and at least pictured once– can be can be completelycompletely answered using A1, A2 answered using A1, A2

• Q4: list all camerasQ4: list all cameras

– can be can be partiallypartially answered using A1, A2 answered using A1, A2

Page 48: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

Semistructured data -- June 2001 48

Illustration: Xyleme

Page 49: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

49

A dynamic warehouse of Web data

• Warehouse– Xyleme stores huge quantities of data (teraB)– Xyleme is not a search engine (only index) or a

mediator (only virtual data)

• XML– Xyleme is focused on XML

• Dynamic– Xyleme is interested in data evolution/changes

Page 50: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

50

Technical Challenges

1. Data Acquisition and Maintenancediscover data of interest and maintain it up to date

2. Repositorystore and index this data

3. Efficient query Processing4. Semantic Integration

provide a simple view of each semantic domain

5. Change ControlMonitor the web and offer services such as Query Subscription

Page 51: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

51

Technical challenges

• Scale to the web

• Size of data: billions of pages

• Size of index: terabytes

• Number of customers– thousands of simultaneous queries– millions of subscriptions

Page 52: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

52

Web Heterogeneity

• Semantic domains, e.g., cinema

• Many possible types for data in this domain, many DTDs

• Semantic Integration– one abstract DTD for the domain– gives the illusion that the system maintains an

homogeneous database for this domain

1 domain = 1 abstract DTD

Page 53: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

53

Discover the Domains

Cluster DTDs sharing similar « tags » using data mining techniques (frequent item sets) and linguistic tools (e.g., thesaurus, heuristics to extract words from composite words or abbreviations, etc.)

to obtain domains

cdtd1 .cdtd2 .cdtd3 .

adtd1

adtd2

adtd4

Many concrete DTDs

Fewer abstract DTDs

cdtd7 .cdtd8 .cdtd9 .cdtd10 .

cdtd4 .

cdtd5 .cdtd6 .

Page 54: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

54

Answering queries

• Choose an ADTD– Automatically, manually, hybrid

• For each concrete DTD in a domain– Find how it relates to the abstract DTD

– Mappings between paths in both

• Distributed query processing (cluster of PCs)– Many concrete DTDs; often not possible to compute a

static execution plan

– Dynamic generation of execution plans [Cluet et al]

Page 55: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

Semistructured data -- June 2001 55

Conclusion

Page 56: Semistructured data -- June 20011 Semistructured data: from practice to theory Serge Abiteboul INRIA & Xyleme SA Serge.Abiteboul@inria.fr Serge.Abiteboul@xyleme.com.

56

One Question Only

• The web is turning from a large collection of documents into a huge knowledge base

When will I be able to get

the precise knowledge I need?

Database + Knowledge Base + Linguistic + ...