Top Banner
atabase Management Systems, R. Ramakrishnan Introduction to Semistructured Data and XML Chapter 27
53

Introduction to Semistructured Data and XML

Mar 15, 2016

Download

Documents

Bruno Stokes

Introduction to Semistructured Data and XML. Chapter 27. How the Web is Today. HTML documents often generated by applications consumed by humans only easy access: across platforms, across organizations No application interoperability: HTML not understood by applications - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 1

Introduction to Semistructured Data and

XMLChapter 27

Page 2: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 2

How the Web is Today

HTML documents• often generated by applications• consumed by humans only• easy access: across platforms, across

organizations No application interoperability:

• HTML not understood by applications• Database technology: client-server

Page 3: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 3

New Universal Data Exchange Format: XML

A recommendation from the W3C XML = data XML generated by applications XML consumed by applications Easy access: across platforms,

organizations

Page 4: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 4

Paradigm Shift on the Web

From documents (HTML) to data (XML) From information retrieval to data

management For databases, also a paradigm shift:

• from relational model to semistructured data

• from data processing to data/query translation

• from storage to transport

Page 5: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 5

HTML HTML is widely used for formatting and

structuring Web documents. Designed to describe how a Web browser should

arrange text, images and push-buttons on a page.

Easy to learn, but does not convey structure and meaning of data in the Web pages.

Fixed tag set.<HTML><HEAD><TITLE>Welcome to the XML course</TITLE></HEAD><BODY>

<H1>Introduction</H1><IMG SRC=”dragon.jpeg" WIDTH="200" HEIGHT="150” >

</BODY></HTML>

Opening tag Text (PCDATA)

Closing tag “Bachelor” tagAttribute name Attribute

value

Page 6: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 6

Semistructure data

1. Information integration: important new application that motivates what follows.

2. Semistructured data: a new data model designed to cope with problems of information integration.

3. XML (Extensible Markup Language) : a new Web standard that is essentially semistructured data.

4. XQUERY: an emerging standard query language for XML data.

Page 7: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 7

Information IntegrationProblem: related data exists in many places. They

talk about the same things, but differ in model, schema, conventions (e.g., terminology).

Example: In the real world, every bar has its own database.

Some may have relations like beer-price; others have an Microsoft Word file from which the menu is printed.

Some keep phones of manufacturers but not addresses.

Some distinguish beers and ales; others do not.

Page 8: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 8

The Semistructured Data Model

&o1

&o12 &o24 &o29

&o43&96

&243 &206

&25

“Serge” “Abiteboul”

1997

“Victor” “Vianu” 122 133

paper bookpaper

references

references references

author title year httpauthor

authorauthor

title publisherauthor

authortitle

page

firstnamelastname firstname lastname first

last

Bib

Object Exchange Model (OEM) complex object

atomic object

Page 9: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 9

Characteristics of Semistructured Data Missing or additional attributes Multiple attributes Different types in different objects Heterogeneous collections

Self-describing, irregular data, no a priori structure

Page 10: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 10

Comparison with Relational Data

{ row: { name: “John”, phone: 3634 },

row: { name: “Sue”, phone: 6343 },

row: { name: “Dick”, phone: 6363 }

}

n a m e p h o n e

J o h n 3 6 3 4

S u e 6 3 4 3

D i c k 6 3 6 3

row row row

name name namephone phone phone

“John” 3634“Sue” “Dick”6343 6363

Page 11: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 11

XML (Extensible Markup Language)

A W3C standard to complement HTML Origins: Structured text SGML

• Large-scale electronic publishing• Data exchange on the web

Motivation:• HTML describes presentation• XML describes content

Page 12: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 12

From HTML to XML

HTML describes the presentation

Page 13: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 13

HTML

<h1> Bibliography </h1><p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995<p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999

Page 14: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 14

XML<bibliography>

<book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley

</publisher> <year> 1995 </year> </book> …

</bibliography>

XML describes the content

Page 15: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 15

Why are we DB’ers interested?

It’s data. That’s us. Database issues:

• How are we going to model XML? (graphs).• How are we going to query XML? (XQuery)• How are we going to store XML (in a

relational database? object-oriented? native?)

• How are we going to process XML efficiently? (many interesting research questions!)

Page 16: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 16

XML Terminology

Tags: book, title, author, …• start tag: <book>, end tag: </book>

Elements: <book>…<book>,<author>…</author>• elements can be nested• empty element: <red></red> (Can be abbrv. <red/>)

XML document: Has a single root element Well-formed XML document: Has matching tags Valid XML document: conforms to a schema

Page 17: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 17

Well-Formed XML1. Declaration = <? ... ?> .

• Normal declaration is<? XML VERSION = "1.0" STANDALONE = "yes" ?>

• “Standalone” means that there is no DTD specified.

2. Root tag surrounds the entire balance of the document. <FOO> is balanced by </FOO>, as in HTML.

3. Any balanced structure of tags OK.• Option of tags that don’t require balance, like <P>

in HTML.

Page 18: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 18

XML: An Example<?xml version="1.0" encoding="UTF-8" standalone="yes"?><BOOKLIST> <BOOK genre="Science" format="Hardcover"> <AUTHOR> <FIRSTNAME>Richard</FIRSTNAME><LASTNAME>Feynman</LASTNAME> </AUTHOR> <TITLE>The Character of Physical Law</TITLE> <PUBLISHED>1980</PUBLISHED> </BOOK> <BOOK genre="Fiction"> <AUTHOR> <FIRSTNAME>R.K.</FIRSTNAME><LASTNAME>Narayan</LASTNAME> </AUTHOR> <TITLE>Waiting for the Mahatma</TITLE> <PUBLISHED>1981</PUBLISHED> </BOOK> <BOOK genre="Fiction"> <AUTHOR> <FIRSTNAME>R.K.</FIRSTNAME><LASTNAME>Narayan</LASTNAME> </AUTHOR> <TITLE>The English Teacher</TITLE> <PUBLISHED>1980</PUBLISHED> </BOOK></BOOKLIST>

Page 19: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 19

XML – Elements

<BOOK genre="Science" format="Hardcover">…</BOOK>

Xml is case and space sensitive Element opening and closing tag names must be identical Opening tags: “<” + element name + “>” Closing tags: “</” + element name + “>” Empty Elements have no data and no closing tag:

• They begin with a “<“ and end with a “/>” <BOOK/>

closing tagattribute

attribute value

dataopen tagelement name

Page 20: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 20

XML – Attributes

<BOOK genre="Science" format="Hardcover">…</BOOK>

Attributes provide additional information for element tags. There can be zero or more attributes in every element; each

one has the the form:attribute_name=‘attribute_value’- There is no space between the name and the “=‘”- Attribute values must be surrounded by “ or ‘ characters

Multiple attributes are separated by white space (one or more spaces or tabs).

closing tagattribute

attribute value

dataopen tagelement name

Page 21: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 21

ElementsThe segment of an XML document between an opening and a corresponding closing tag is called an element.

<person> <name> Malcolm Atchison </name>

<tel> (215) 898 4321 </tel> <tel> (215) 898 4321 </tel>

<email> [email protected] </email> </person>

element

not an elementelement, a sub-elementof

Page 22: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 22

XML – Data and Comments

<BOOK genre="Science" format="Hardcover">…</BOOK>

Xml data is any information between an opening and closing tag

Xml data must not contain the ‘<‘ or ‘>’ characters

Comments:<!- comment ->

closing tagattribute

attribute value data

open tagelement name

Page 23: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 23

XML textXML has only one “basic” type -- text.

It is bounded by tags, e.g. <title> The Big Sleep </title> <year> 1935 </ year> --- 1935 is still text

XML text is called PCDATA (for parsedcharacter data). It uses a 16-bit encoding.

Page 24: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 24

XML – Nesting & Hierarchy

Xml tags can be nested in a tree hierarchy Xml documents can have only one root tag Between an opening and closing tag you can insert:

1. Data2. More Elements3. A combination of data and elements

<root> <tag1> Some Text <tag2>More</tag2> </tag1></root>

Page 25: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 25

Representing relational DBs:Two ways projects:

title budget managedBy

employees:name ssn age

Page 26: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 26

Project and Employee relations in XML

<db> <project> <title> Pattern recognition

</title> <budget> 10000 </budget> <managedBy>

Joe</managedBy> </project> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 < /age> </employee>

<employee> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> </employee> <project> <title> Auto guided vehicle </title> <budget> 70000 </budget> <managedBy> Sandra </managedBy> </project> :</db>

Projects and employees are intermixed

Page 27: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 27

<db><projects>

<project> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy>Joe </managedBy>

</project> <project> <title>Auto guided vehicles</title> <budget> 70000 </budget>

<managedBy>Sandra</managedBy> </project> : </projects>

Project and Employee relations in XML (cont’d)

<employees><employee>

<name> Joe </name> <ssn> 344556 </ssn> <age> 34 </age> </employee> <employee> <name> Sandra

</name> <ssn> 2234 </ssn>

<age>35 </age> </employee> : <employees></db>

Employees follows projects

Page 28: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 28

More XML: Oids and References<person id=“o555”> <name> Jane </name>

</person>

<person id=“o456”> <name> Mary </name>

<children idref=“o123 o555”/>

</person>

<person id=“o123” mother=“o456”><name>John</name>

</person>oids and references in XML are just syntax

Page 29: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 29

XML Data Model (Graph)

bookb1

b2

title authorauthor author

pcdataComplete... P rincip les...Chamberlin Bernste in Newcomer

pcdata pcdata pcdata pcdata

publisher

nam e state

CAMorgan...

pcdata pcdata

pub pub

db

mkp

#1 #2 #3 #4 #5 #6 #7

#0

book

title

Page 30: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 30

Document Type Descriptors

<!ELEMENT Book (title, author*) >

<!ELEMENT title #PCDATA> <!ELEMENT author (name, address,age?)>

<!ATTLIST Book id ID #REQUIRED> <!ATTLIST Book pub IDREF #IMPLIED>

Sort of like a schema but not really.

Inherited from SGML DTD standard BNF grammar establishing constraints on element structure and content Definitions of entities

Page 31: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 31

DTD – An Example

<?xml version='1.0'?><!ELEMENT Basket (Cherry+, (Apple | Orange)*) >

<!ELEMENT Cherry EMPTY><!ATTLIST Cherry flavor CDATA #REQUIRED>

<!ELEMENT Apple EMPTY><!ATTLIST Apple color CDATA #REQUIRED>

<!ELEMENT Orange EMPTY><!ATTLIST Orange location ‘Florida’>

-------------------------------------------------------------------------------- <Basket>

<Apple/> <Cherry flavor=‘good’/> <Orange/></Basket>

<Basket> <Cherry flavor=‘good’/> <Apple color=‘red’/> <Apple color=‘green’/></Basket>

Page 32: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 32

DTD - !ELEMENT

<!ELEMENT Basket (Cherry+, (Apple | Orange)*) >

!ELEMENT declares an element name, and what children elements it should have

Content types:• Other elements• #PCDATA (parsed character data)• EMPTY (no content)• ANY (no checking inside this structure)• A regular expression

Name Children

Page 33: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 33

DTD - !ELEMENT (Contd.)

A regular expression has the following structure:• exp1, exp2, exp3, …, expk: A list of regular

expressions• exp*: An optional expression with zero or more

occurrences• exp+: An optional expression with one or more

occurrences• exp1 | exp2 | … | expk: A disjunction of

expressions

Page 34: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 34

DTD - !ATTLIST

<!ATTLIST Cherry flavor CDATA #REQUIRED>

<!ATTLIST Orange location CDATA #REQUIREDcolor ‘orange’>

!ATTLIST defines a list of attributes for an element

Attributes can be of different types, can be required or not required, and they can have default values.

Element Attribute Type Flag

Page 35: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 35

DTD – Well-Formed and Valid<?xml version='1.0'?><!ELEMENT Basket (Cherry+)>

<!ELEMENT Cherry EMPTY><!ATTLIST Cherry flavor CDATA #REQUIRED>

--------------------------------------------------------------------------------

Well-Formed and Valid<Basket> <Cherry flavor=‘good’/></Basket>

Not Well-Formed<basket> <Cherry flavor=good></Basket>

Well-Formed but Invalid<Job> <Location>Home</Location></Job>

Page 36: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 36

Example: An Address Book<person> <name> MacNiel, John </name><greet> Dr. John MacNiel </greet><addr>1234 Huron Street </addr><addr> Rome, OH 98765 </addr><tel> (321) 786 2543 </tel><fax> (321) 786 2543 </fax><tel> (321) 786 2543 </tel><email> [email protected] </email></person>

Exactly one nameAt most one greetingAs many address lines as needed (in order)

Mixed telephones and faxes

As manyas needed

Page 37: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 37

Specifying the structure name to specify a name element greet? to specify an optional

(0 or 1) greet elements

name,greet? to specify a name followed by an optional greet

Page 38: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 38

Specifying the structure (cont) addr* to specify 0 or more address

lines tel | fax a tel or a fax element (tel | fax)* 0 or more repeats of tel or

fax email* 0 or more email elements

Page 39: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 39

A DTD for the address book<!DOCTYPE addressbook [ <!ELEMENT addressbook (person*)> <!ELEMENT person (name, greet?, address*, (fax | tel)*, email*)> <!ELEMENT name (#PCDATA)> <!ELEMENT greet (#PCDATA)> <!ELEMENT address(#PCDATA)> <!ELEMENT tel (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT email (#PCDATA)>]>

Page 40: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 40

DTD for the example relational DB

<!DOCTYPE db [<!ELEMENT db (projects,employees)><!ELEMENT projects (project*)><!ELEMENT employees (employee*)>

<!ELEMENT project (title, budget, managedBy)>

<!ELEMENT employee (name, ssn, age)>...

]>

Page 41: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 41

Summary of XML regular expressions Each element name is a tag. Its components are the tags that appear

nested within, in the order specified. A The tag A occurs e1,e2 The expression e1 followed by e2 e* 0 or more occurrences of e e? Optional -- 0 or 1 occurrences e+ 1 or more occurrences e1 | e2 either e1 or e2 (e) grouping

Page 42: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 42

XML Querying

Path Expressions : Bib.paper Bib.book.publisher Bib.paper.author.lastname

Given an OEM instance, the value of a path expression p is a set of objects

Page 43: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 43

Path Expressions

Examples:

DB =

&o1

&o12 &o24 &o29

&o43

&o70 &o71

&96

&243 &206

&25

“Serge” “Abiteboul”

1997

“Victor” “Vianu” 122 133

paper bookpaper

references

references references

authortitle year httpauthor

authorauthor

title publisherauthor

authortitle

page

firstnamelastname firstname lastname first

last

Bib

&o44 &o45 &o46

&o47 &o48 &o49 &o50 &o51

&o52

Bib.paper={&o12,&o29}Bib.book.publisher={&o51}Bib.paper.author.lastname={&o71,&206}

Page 44: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 44

XQueryEmerging standard for querying XML documents. Basic

form:FOR <variables ranging over sets of elements>WHERE <condition>RETURN <set of elements>;

Sets of elements described by paths, consisting of:1. URL, if necessary.2. Element names forming a path in the semistructured

data graph, e.g., //BAR/NAME =“start at any BAR node and go to a NAME child.”

3. Ending condition of the form[<condition about subelements, @attributes, and

values>]

Page 45: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 45

XQueryOverview: FOR-LET-WHERE-ORDERBY-RETURN = FLWOR

FOR/LET Clauses

WHERE Clause

ORDERBY/RETURN Clause

List of tuples

List of tuples

Instance of Xquery data model

Page 46: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 46

XQuery

FOR $x in expr -- binds $x to each value in the list expr

LET $x = expr -- binds $x to the entire list expr• Useful for common subexpressions and for

aggregations

Page 47: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 47

FOR v.s. LET

FOR $x IN document("bib.xml")/bib/book

RETURN <result> $x </result>

Returns: <result> <book>...</book></result> <result> <book>...</book></result> <result> <book>...</book></result> ...

LET $x IN document("bib.xml")/bib/book

RETURN <result> $x </result>

Returns: <result> <book>...</book> <book>...</book> <book>...</book> ...</result>

Page 48: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 48

XQuery

Find all book titles published after 1995:

FOR $x IN document("bib.xml")/bib/book

WHERE $x/year > 1995

RETURN $x/title

Result: <title> abc </title> <title> def </title> <title> ghi </title>

Page 49: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 49

XQuery

For each author of a book by Morgan Kaufmann, list all books s/he published:

FOR $a IN distinct(document("bib.xml") /bib/book[publisher=“Morgan Kaufmann”]/author)

RETURN <result>

$a,

FOR $t IN /bib/book[author=$a]/title

RETURN $t

</result>

distinct = a function that eliminates duplicates

Page 50: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 50

XQuery

Result: <result> <author>Jones</author> <title> abc </title> <title> def </title> </result> <result> <author> Smith </author> <title> ghi </title> </result>

Page 51: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 51

XQuery

count = a (aggregate) function that returns the number of elms

<big_publishers> FOR $p IN distinct(document("bib.xml")//publisher) LET $b := document("bib.xml")/book[publisher = $p] WHERE count($b) > 100 RETURN $p </big_publishers>

Page 52: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 52

XQuery

Find books whose price is larger than average:

LET $a=avg(document("bib.xml")/bib/book/price)

FOR $b in document("bib.xml")/bib/book

WHERE $b/price > $a

RETURN $b

Page 53: Introduction to Semistructured Data and XML

Database Management Systems, R. Ramakrishnan 53

Examples for XQuery queries FOR $x IN

doc(www.company.com/info.xml) //employee [employeeSalary gt 70000]/employeeName

RETURN <res> $x/firstName, $x/lastName </res> FOR $x IN

doc(www.company.com/info.xml)/company/employeeWHERE $x/employeeSalary gt 70000RETURN <res> $x/employeeName/firstName,

$x/employeeName/lastName </res> FOR $x IN

doc(www.company.com/info.xml)/company/project [projectNumber = 5]/projectWorker,

$y INdoc(www.company.com/info.xml)/company/employee WHERE $x/hours gt 20.0 AND $y.ssn = $x.ssnRETURN <res> $x/EmployeeName/firstName, $y/employeeName/lastName, $x/hours </res>