Top Banner
1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet
99

1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

1

XML – Extensible Markup Language

DBI – Representation and Management of Data on the

Internet

Page 2: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

2

Part I: Background

• What’s the difference between– The world of documents and

information retrieval, and – Databases and query interfaces?

Page 3: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

3

Documents vs. DatabasesDocument World• Plenty of small

documents• Usually static • Implicit structure:

section, paragraph,table of contents

• Tagging

Database World• A few large

databases• Usually dynamic• Explicit

structure: schema

• Records

Page 4: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

4

Documents vs. Databases (cont’d)

Document World• Human friendly• Content:

form/layout,annotation

• Paradigms:“Save as”,Wysiwyg

• Meta-data:author name,date, subject

Database World• Machine friendly• Content:

schema, data,methods

• Paradigms:Atomicity, Concurrency,Isolation, Durability

• Meta-data:schema description

Page 5: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

5

What can be Done with Them

Documents Database

editing

printing

spell-checkingcounting

words retrieving (IR)

searching

updating

clustering

cleaning

querying

adjusting

transforming

Page 6: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

6

HTML• Hypertext Markup Language• Used for publishing hypertext on

the World-Wide Web• Designed to describe how a Web

browser should arrange text, images and push-buttons on a page

• Easy to learn, but does not convey structure

• Fixed tag set

Page 7: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

7

HTML Example

<HTML><HEAD><TITLE>Welcome to the DBI course</TITLE></HEAD><BODY>

<H1>Introduction</H1><IMG SRC= "dragon.gif" WIDTH="200" HEIGHT="150" >

</BODY></HTML>

Opening tagText (PCDATA)

Closing tag

“Bachelor” tag

Attribute name Attribute value

Opening tagText (PCDATA)

Closing tagAttribute name Attribute value

“Bachelor” tag

Page 8: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

8

HTML

• The World-Wide Web is constructed from HTML documents

• We can apply information-retrieval techniques to a set of documents– For example, clustering as Google

does

• How can we apply database techniques to the Web?

Page 9: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

9

HTML Pages

• We can– Edit (and put on the Web)– Print (or view with a browser)– Spell-check– Count words– Retrieve (again, with a browser)– Search (with a search engine, for

example)– Cluster

Page 10: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

10

How can we Ask Queries?• How can we find automatically the

cheapest flight from Israel to Micronezia, knowing the Web sites of all airlines that have flights to Micronezia?

• How can we find automatically the phone numbers of people that advertised on the Web that they want to sell a car for a price that is not greater than 30,000 IS?

• It can be useful to query data as we do in databases

Page 11: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

11

Thin Red Line

• The line between the document world and the database world is not clear

• In some cases, both approaches are legitimate

• An interesting middle ground is data formats – of which XML is an example

Page 12: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

12

The Structure of XML• XML consists of tags and text• Tags come in pairs <date> ...</date>• They must be properly nested

– good <date> ... <day> ... </day> ...

</date>– bad

<date> ... <day> ... </date>... </day>

(You can’t do <i> ... <b> ... </i> ...</b> in HTML)

Page 13: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

13

XML TextXML has only one “basic” type – text

It is bounded by tags, e.g., <title> The Big Sleep </title> <year> 1935 </ year> – 1935 is still

text

• XML text is called PCDATA – (for parsed character data)

• It uses a 16-bit encoding, e.g., \&\#x0152 for the Hebrew letter Mem

Page 14: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

14

XML Structure• Nesting tags can be used to

express various structures, e.g., a tuple (record):

<person><name> Lisa Simpson</name><tel> 02-828-1234 </tel><tel> 054-470-777 </tel><email> [email protected] </email>

</person>

Page 15: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

15

XML Structure (cont’d)

• We can represent a list by using the same tag repeatedly:

<addresses><person> … </person><person> … </person><person> … </person><person> … </person>…

</addresses>

Page 16: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

16

XML Structure (cont’d)

<addresses><person>

<name> Donald Duck</name><tel> 04-828-1345 </tel><email> [email protected] </email>

</person><person>

<name> Miki Mouse</name><tel> 03-426-1142 </tel><email>[email protected]</email>

</person></addresses>

Page 17: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

17

TerminologyThe segment of an XML document between an opening and a corresponding closing tag is called an element

<person> <name> Bart Simpson </name>

<tel> 02 – 444 7777 </tel> <tel> 051 – 011 022 </tel>

<email> [email protected] </email> </person>

element

element, a sub-element of

not an element

Page 18: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

18

XML Document is a Tree

person

name emailtel tel

Bart Simpson

02 – 444 7777

051 – 011 022

[email protected]

Semistructured data models typically put the labels on the edges

Page 19: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

19

Mixed Content

An element may contain a mixture of sub-elements and PCDATA

<airline> <name> British Airways </name> <motto> World’s <dubious> favorite</dubious>

airline </motto></airline>

Page 20: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

20

Needs for Mixed Content• Mixed-content data is not typically

generated from databases • It is needed for consistency with HTML• For example:

<html><head></head><body>

Why can’t you find <it>dragons</it> in a restaurant?

Because <b>smoking</b> is not allowed</body>

</html>

Page 21: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

21

A Complete XML Document<?XML version ="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE addresses SYSTEM "http://www.cs.huji.ac.il/~dbi/dbi-addresses.dtd"><addresses>

<person><name>Lisa Simpson</name><tel> 02-828-1234 </tel><tel> 054-470-777 </tel><email> [email protected] </email>

</person></addresses>

Page 22: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

22

The Header Tag

• <?xml version="1.0“ standalone="yes/no" encoding="UTF-8"?>

• You can leave out the encoding attribute and the processor will use the UTF-8 default

Page 23: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

23

Processing Instructions<?xml version="1.0"?><?xml-stylesheet   href="doc.xsl“

type="text/xsl"   ?>

<!DOCTYPE doc SYSTEM "doc.dtd“>

<doc>Hello, world!<!-- Comment 1 --></doc>

<?pi-without-data     ?><!-- Comment 2 --><!-- Comment 3 -->

Page 24: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

24

Two Ways of Representing a Relational Database in XML

projects:title budget managedBy

employees:

name ssn age

Page 25: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

25

Project and Employee relations in XML

<db> <project> <title> Pattern recognition

</title> <budget> 10000

</budget> <managedBy> Joe

</managedBy> </project> <employee> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 < /age> </employee>

<employee> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> </employee> <project> <title> Auto guided vehicle

</title> <budget> 70000 </budget> <managedBy> Sandra </managedBy> </project> :</db>

Projects and employees are intermixed

Page 26: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

26

<db> <projects> <project> <title> Pattern recognition </title> <budget> 10000 </budget> <managedBy> Joe</managedBy> </project> <project> <title> Auto guided vehicles </title> <budget> 70000 </budget>

<managedBy>Sandra</managedBy>

</project> : </projects>

<employees><employee>

<name> Joe </name>

<ssn> 344556 </ssn>

<age> 34 </age> </employee> <employee>

<name>Sandra</name> <ssn> 2234 </ssn>

<age>35 </age> </employee> : <employees></db>

Employees follow projects

Projects

Employees

Page 27: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

27

<db> <projects> <title> Pattern recognition

</title> <budget> 10000 </budget> <managedBy> Joe

</managedBy> <title> Auto guided vehicles

</title> <budget> 70000 </budget> <managedBy> Sandra

</managedBy> : </projects>

<employees> <name> Joe </name> <ssn> 344556 </ssn> <age> 34 </age> <name> Sandra </name> <ssn> 2234 </ssn> <age> 35 </age> : </employees></db>

Or without “separator” tags …

Can be done if it is clearwhere each employeeand each project starts

Page 28: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

28

Attributes• An (opening) tag may contain attributes • These are typically used to describe the contents of an element

<entry> <word language = “en”> cheese</word> <word language = “fr”> fromage</word> <word language = “ro”> branza </word> <meaning> A food made … </meaning></entry>

Page 29: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

29

Attributes (cont’d)

Another common use for attributes is to express dimension or type

<picture> <height dim= “cm”> 2400 </height> <width dim= “in”> 96 </width> <data encoding = “gif” compression = “zip”> M05-.+C$@02!G96YE<FEC ... </data></picture>

Page 30: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

30

Well-Formed Documents

A document that – obeys the “nested-tags” rule,

and – does not repeat an attribute

within a tag

is said to be well-formed

Page 31: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

31

<addresses ><person friend="yes">

<name> Jeff Cohen</name><tel> 04-828-1345 </tel><tel> 054-470-778 </tel><email> [email protected] </email>

</person><person friend="no">

<name> Irma Levy</name><tel> 03-426-1142 </tel><email>[email protected]</email>

</person></addresses>

UsingAttributes

Page 32: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

32

When to Use Attributes

• It’s not always clear when to use attributes

<person ssno= “123 4589”> <name> L. Simpson

</name> <email> [email protected] </email> ...</person>

<person> <ssno> 123 4589 </ssno> <name> L. Simpson </name> <email> [email protected] </email> ...</person>

Page 33: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

33

Using IDs<person id="jeff" friend="yes" knows="irma">

<name> Jeff Cohen</name><tel> 04-828-1345 </tel><tel> 054-470-778 </tel><email> [email protected] </email>

</person><person id="irma" friend="no" knows="jeff">

<name> Irma Levy</name><tel> 03-426-1142 </tel><email>[email protected]</email>

</person>

IDattributes

Page 34: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

34

Using IDs<family> <person id=“lisa” mother=“marge” father=“homer”> <name> Lisa Simpson </name> </person>

<person id=“bart” mother=“marge” father=“homer”> <name> Bart Simpson </name> </person> <person id=“marge” children=“bart lisa”> <name> Marge Simpson </name> </person> <person id=“homer” children=“bart lisa”> <name> Homer Simpson </name> </person></family>

Page 35: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

35

ODL Schema

class Movie

( extent Movies, key title )

{

attribute string title;

attribute string director;

relationship set<Actor> casts

inverse Actor::acted_In;

attribute int budget;

} ;

class Actor

( extent Actors, key name )

{

attribute string name;

relationship set<Movie> acted_In

inverse Movie::casts;

attribute int age;

attribute set<string> directed;

} ;

Page 36: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

36

<db> <movie id=“m1”> <title>Waking Ned

Divine</title> <director>Kirk Jones

III</director> <cast idrefs=“a1

a3”></cast> <budget>100,000</budget>

</movie> <movie id=“m2”> <title>Dragonheart</title> <director>Rob

Cohen</director> <cast idrefs=“a2 a9

a21”></cast> <budget>110,000</budget>

</movie> <movie id=“m3”> <title>Moondance</title> <director>Dagmar

Hirtz</director> <cast idrefs=“a1

a8”></cast> <budget>90,000</budget> </movie> :

class Movie

( extent Movies, key title )

{

attribute string title;

attribute string director;

relationship set<Actor> casts

inverse Actor::acted_In;

attribute int budget;

} ;

Page 37: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

37

class Actor

( extent Actors, key name )

{

attribute string name;

relationship set<Movie> acted_In

inverse Movie::casts;

attribute int age;

attribute set<string> directed;

} ;

<db> : <actor id=“a1”> <name>David Kelly</name> <acted_In idrefs=“m1 m3 m78” > </acted_In> </actor> <actor id=“a2”> <name>Sean Connery</name> <acted_In idrefs=“m2 m9 m11”> </acted_In> <age>68</age> </actor> <actor id=“a3”> <name>Ian Bannen</name> <acted_In idrefs=“m1 m35”> </acted_In> </actor> :</db>

Page 38: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

38

<db> <movie id=“m1”> <title>Waking Ned

Divine</title> <director>Kirk Jones

III</director> <cast idrefs=“a1

a3”></cast> <budget>100,000</budget>

</movie> <movie id=“m2”> <title>Dragonheart</title> <director>Rob

Cohen</director> <cast idrefs=“a2 a9

a21”></cast> <budget>110,000</budget>

</movie> <movie id=“m3”> <title>Moondance</title> <director>Dagmar

Hirtz</director> <cast idrefs=“a1

a8”></cast> <budget>90,000</budget> </movie> :

<actor id=“a1”> <name>David Kelly</name> <acted_In idrefs=“m1 m3 m78” > </acted_In> </actor> <actor id=“a2”> <name>Sean Connery</name> <acted_In idrefs=“m2 m9 m11”> </acted_In> <age>68</age> </actor> <actor id=“a3”> <name>Ian Bannen</name> <acted_In idrefs=“m1 m35”> </acted_In> </actor> :</db>

Page 39: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

39

Part II: Document Type Descriptors

Imposing Structure on XML Documents

Page 40: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

40

Document Type Descriptors

• Document Type Descriptors (DTDs) impose structure on an XML document

• There is some relationship between a DTD and a schema, but it is not close – hence the need for additional “typing” systems (XML schemas)

• The DTD is a syntactic specification

Page 41: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

41

Example: An Address Book<person>

<name> Homer Simpson </name>

<greet> Dr. H. Simpson </greet>

<addr>1234 Springwater Road </addr>

<addr> Springfield USA, 98765 </addr>

<tel> (321) 786 2543 </tel>

<fax> (321) 786 2544 </fax>

<tel> (321) 786 2544 </tel>

<email> [email protected] </email>

</person>

Mixed telephones and faxes

As manyas needed

As many address lines as needed (in order)

At most one greeting

Exactly one name

Page 42: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

42

Specifying the Structure

• name to specify a name element

• greet? to specify an optional (0 or 1) greet

elements

• name, greet? to specify a name followed by an optional greet

Page 43: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

43

Specifying the Structure (cont’d)

• addr* to specify 0 or more address lines

• tel | fax a tel or a fax element

• (tel | fax)* 0 or more repeats of tel or fax

• email* 0 or more email elements

Page 44: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

44

Specifying the Structure (cont’d)

• So the whole structure of a person entry is specified by

name, greet?, addr*, (tel | fax)*, email*

• This is known as a regular expression

• Why is it important?

Page 45: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

45

Regular Expressions

• Each regular expression determines a corresponding finite state automaton• Let’s start with a simpler example:

name, addr*, emailname

addr

email

This suggests a simple parsing program

Page 46: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

46

Another Example

name,address*,(tel | fax)*,email*

name

address

tel

tel

fax

fax

email

email

Adding in the optional greet furthercomplicates things

email

Page 47: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

47

Internal DTD For the Address Book

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE addressbook [ <!ELEMENT addressbook (person*)> <!ELEMENT person (name, greet?, address*, (fax | tel)*, email*)> <!ELEMENT name (#PCDATA)> <!ELEMENT greet (#PCDATA)> <!ELEMENT address(#PCDATA)> <!ELEMENT tel (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT email (#PCDATA)>]>

The name ofthe DTD is

addressbook

“Internal” means that the DTD and theXML Document are in the same file

Page 48: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

48

Rest of the Address Book

<addressbook> <person> <name> Jeff Cohen </name> <greet> Dr. Cohen </greet>

<email> [email protected] </email> </person></addressbook>

Page 49: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

49

Our Relational DB Revisited

projects:

title budget managedBy

employees:

name ssn age

Page 50: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

50

Two DTDs for the Relational DB

<!DOCTYPE db [<!ELEMENT db (projects,employees)><!ELEMENT projects (project*)><!ELEMENT employees (employee*)>

<!ELEMENT project (title, budget, managedBy)>

<!ELEMENT employee (name, ssn, age)>...

]><!DOCTYPE db [

<!ELEMENT db (project | employee)*><!ELEMENT project (title, budget,

managedBy)><!ELEMENT employee (name, ssn, age)>...]>

Page 51: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

51

Recursive DTDs

<DOCTYPE genealogy [<!ELEMENT genealogy (person*)><!ELEMENT person (name,dateOfBirth,person, -- motherperson )> -- father ...

]>

What is the problem with this?A parser does not notice it!

Each person should have a father and amother. Thisleads to eitherinfinite data ora person thatis a descendentof himself.

Page 52: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

52

Recursive DTDs (cont’d)

<DOCTYPE genealogy [<!ELEMENT genealogy (person*)><!ELEMENT person (

name,dateOfBirth,person?, -- motherperson? )> -- father

... ]>

What is now the problem with this?

If a person hasonly a father, how can you tell that he has a father anddoes not havea mother?

Page 53: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

53

Some Things are Hard to Specify

Each employee element is to contain name, age and ssn elements in some order

<!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) |

(ssn, name, age) | ... )>

Suppose there were many more fields!

Page 54: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

54

Some Things are Hard to Specify (cont’d)

<!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) |

(ssn, name, age) | ... )>

Suppose there were many more fields!

There are n! differentorders of n elements

It is not even polynomial

Page 55: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

55

General Definitions of Entities

ANY - tells that the element can have any

content

EMPTY - tells that the element has nocontent

Page 56: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

56

Summary of XML regular expressions

• A The tag A occurs• e1,e2 The expression e1 followed

by e2• e* 0 or more occurrences of e• e? Optional – 0 or 1 occurrences• e+ 1 or more occurrences• e1 | e2 either e1 or e2• (e) grouping

Page 57: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

57

Deterministic Requirement• If element-type declarations are

deterministic, it is easier• Formally, the Glushkov automaton

is deterministic• The states of this automaton are

the positions of the regular expression (semantic actions)

• The transitions are based on the “follows set”

Page 58: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

58

Deterministic Requirement (cont’d)

• The associated automata are succinct

• A regular language may not have an associated deterministic grammar, e.g., <!ELEMENT ndeter

((movie|director)*,movie,(movie|director))>

Page 59: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

59

Specifying Attributes in the DTD

<!ELEMENT height (#PCDATA)><!ATTLIST height dimension CDATA #REQUIRED accuracy CDATA #IMPLIED >

The dimension attribute is required The accuracy attribute is optional

CDATA is the “type” of the attribute – it means string, and may take any literal string as a value

Page 60: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

60

Specifying ID and IDREF Attributes

<!DOCTYPE family [ <!ELEMENT family (person)*> <!ELEMENT person (name)> <!ELEMENT name (#PCDATA)> <!ATTLIST person

id ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED>]>

Page 61: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

61

Specifying ID and IDREF Attributes (cont’d)

• The attributes mother and father are references to IDs of other elements

• However, those are not necessarily person elements!

• The mother attribute is not necessarily a reference to a female personReferences to IDs

have no type

Page 62: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

62

Some Conforming Data<family> <person id=“lisa” mother=“marge” father=“homer”> <name> Lisa Simpson </name> </person>

<person id=“bart” mother=“marge” father=“homer”> <name> Bart Simpson </name> </person> <person id=“marge” children=“bart lisa”> <name> Marge Simpson </name> </person> <person id=“homer” children=“bart lisa”> <name> Homer Simpson </name> </person></family>

Page 63: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

63

Consistency of ID and IDREF Attribute Values

•If an attribute is declared as ID– the associated values must all be distinct (no

confusion)

•If an attribute is declared as IDREF– the associated value must exist as the value

of some ID attribute (no dangling “pointers”)

•Similarly for all the values of an IDREFS attribute

•ID and IDREF attributes are not typed

Page 64: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

64

A Useful AbbreviationWhen an element has empty content we can use• <br/> for <br></br>• <hr width=“10”/> for <hr width=“10”></hr>

For example:<family>

<person id = “lisa”><name> Lisa Simpson </name>

<mother idref = “marge”/> <father idref = “homer”/>

</person>...

</family>

Page 65: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

65

An Alternative Specification <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE family [

<!ELEMENT family (person)*><!ELEMENT person (name, mother?, father?, children?)><!ATTLIST person id ID #REQUIRED><!ELEMENT name (#PCDATA)><!ELEMENT mother EMPTY><!ATTLIST mother idref IDREF #REQUIRED><!ELEMENT father EMPTY><!ATTLIST father idref IDREF #REQUIRED><!ELEMENT children EMPTY><!ATTLIST children idrefs IDREFS #REQUIRED>

]>

Page 66: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

66

The Revised Data<family>

<person id=“marge"> <name> Marge Simpson </name> <children idrefs=“bart lisa"/>

</person><person id=“homer"> <name> Homer Simpson </name> <children idrefs=“bart lisa"/></person>

<person id=“bart"> <name> Bart

Simpson </name> <mother idref=“marge"/> <father idref=“homer"/>

</person><person id=“lisa"> <name> Lisa Simpson </name></person>

</family>

Page 67: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

67

ODL Schema

class Movie

( extent Movies, key title )

{

attribute string title;

attribute string director;

relationship set<Actor> cast

inverse Actor::acted_In;

attribute int budget;

} ;

class Actor

( extent Actors, key name )

{

attribute string name;

relationship set<Movie>

acted_In

inverse Movie::cast;

attribute int age;

attribute set<string> directed;

} ;

Page 68: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

68

Schema.dtd

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE db [ <!ELEMENT db (movie+, actor+)> <!ELEMENT movie

(title,director,cast,budget)> <!ATTLIST movie id ID #REQUIRED> <!ELEMENT title (#PCDATA)> <!ELEMENT director (#PCDATA)> <!ELEMENT cast EMPTY> <!ATTLIST cast idrefs IDREFS #REQUIRED> <!ELEMENT budget (#PCDATA)>

The DTD continues in the next slide

Page 69: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

69

Schema.dtd (cont’d)

<!ELEMENT actor (name, acted_In,age?, directed*)>

<!ATTLIST actor id ID #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT acted_In EMPTY> <!ATTLIST acted_In idrefs IDREFS

#REQUIRED> <!ELEMENT age (#PCDATA)> <!ELEMENT directed (#PCDATA)>]>

Page 70: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

70

Data <db> <movie id="ohgod"> <title> Oh God!</title> <director> Woody Allen </director> <cast idrefs="burns"></cast> <budget> $2M </budget> </movie> <actor id="burns"> <name> George Burns </name> <acted_In idrefs="ohgod" /> </actor></db>

Page 71: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

71

Constraints on IDs and IDREFs

• ID stands for identifier

– No two ID attributes may have the same value (of type CDATA)

• IDREF stands for identifier reference

– Every value associated with an IDREF attribute must exist as an ID attribute value

• IDREFS specifies several (0 or more) identifiers

Page 72: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

72

Adding a DTD to the Document• A DTD can be internal

– The DTD is part of the document file

• or external– The DTD and the document are on

separate files– An external DTD may reside

•In the local file system (where the document is)

•In a remote file system

Page 73: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

73

Connecting a Document with its DTD

• An internal DTD:<?xml version="1.0"?>

<!DOCTYPE db [<!ELEMENT ...> … ]><db> ... </db>

• A DTD from the local file system: <!DOCTYPE db SYSTEM "schema.dtd">

• A DTD from a remote file system: <!DOCTYPE db SYSTEM "http://www.schemaauthority.com/schema.dtd">

Page 74: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

74

Well-formed and Valid Documents

• A document (with or without a DTD) is well-formed if it has– proper nesting of tags and unique attributes

• A valid document conforms to the DTD, i.e.,– the document conforms to the regular-

expression grammar,

– types of attributes are correct, and

– constraints on references are satisfied

Page 75: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

75

DTDs vs. Schemas (or Types)• DTDs are rather weak specifications by DB

& programming-language standards– Only one base type – PCDATA– No useful “abstractions”, e.g., sets– IDREFs are untyped – the type of the object

being referenced is not known– No constraints, e.g., child is inverse of parent– No methods– Tag definitions are global

• Some extensions of XML impose a schema or types on an XML document

We may see these later

Page 76: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

76

Part III: Entities

To Take Storage into Account

Page 77: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

77

What are Entities?

• An entity is a shortcut to a set of information

• You might think of an entity as being a bit like a macro

• Entities allow dividing a document between some different storage devices

Page 78: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

78

Why to Use Entities

• Entities allow sharing data between documents

• Entities save typing• Entities can reduce errors • Entities are easy to update• Entities can act as placeholders for

TBD (to be determined) information

Page 79: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

79

Defining Entities• Entities can be defined

– in the local document as part of the DOCTYPE definition

– with a link to external files that contain the entity data (this, too, is done through the DOCTYPE definition)

– in an external DTD

• Define locally when the entity is being used only in one particular document

• Define by a link to an external file when the entity is being used in many documents

Page 80: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

80

Kinds of Entities

There are two kinds of entities:• General entities

– For usage in documents

• Parameter entities– For usage in declarations

Page 81: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

81

General entities

• The definition of a general entitiy in the DTD

<!ENTITY Name EntityDefinition >

• The usage of the entity in the document is by

&Name;

Page 82: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

82

Example<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE mdb [

<!ENTITY bm "bad movie"> <!ELEMENT mdb (movie+)>

<!ELEMENT movie (title,director,cast?,budget)>]><mdb>

<movie id="ohgod" opinion="&bm;"><title> Oh God!</title><director> Woody Allen </director><budget> $2M </budget>

</movie></mdb>

Page 83: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

83

Browser View

Page 84: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

84

Unparsed Entities<!DOCTYPE mdb [

<!NOTATION gif SYSTEM "c:\Program Files\Netscape\Communicator\Program\Netscape.exe"><!ENTITY starpicture SYSTEM "http://www.cs.huji.ac.il/~dbi/figures/star.gif" NDATA gif><!ENTITY bm "bad movie"><!ELEMENT mdb (movie+)><!ELEMENT movie (title,director, budget)><!ATTLIST movie id ID #REQUIRED

opinion CDATA #IMPLIED starimage ENTITY #IMPLIED>

<!ELEMENT title (#PCDATA)><!ELEMENT director (#PCDATA)><!ELEMENT budget (#PCDATA)>

]>Entities are defined

Types are

defined

Page 85: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

85

Data

<mdb>

<movie id="ohgod" opinion="&bm;" starimage="starpicture">

<title> Oh God!</title>

<director> Woody Allen </director>

<budget> $2M </budget>

</movie>

</mdb>

Page 86: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

86

Parameter Entities• Parameter entities are used only within DTDs• They carry information for use in the markup

declaration

– Internal entities - references are within the DTD

– External entities - references draw information

from outside files

• Parameter Entity declaration:<!ENTITY % Name EntityDefinition >

Page 87: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

87

Parameter Entity Example<?xml version="1.0" encoding="UTF-8"?><!ENTITY % essential "name, tel*"><!ELEMENT email (#PCDATA)><!ELEMENT tel (#PCDATA)><!ELEMENT name (#PCDATA)><!ELEMENT person (%essential;, email, advisor?)><!ATTLIST person friend (yes | no) #IMPLIED id ID #REQUIRED knows IDREFS #IMPLIED><!ELEMENT advisor (person)><!ELEMENT addresses (person)*>

Page 88: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

88

Defining Entities• Local Definition:

<!DOCTYPE [ <!ENTITY copyright

"Copyright 2000, As The World Spins Corp. All

rights reserved. Please do not copy or use without

authorization. For authorization contact

[email protected]."> ]>

• Global Definition:<!DOCTYPE [ <!ENTITY copyright SYSTEM

"http://www.worldspins.com/legal/copyright.xml"> ]>

Page 89: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

89

Example<?xml version="1.0"><!DOCTYPE [ <!ENTITY copyright "Copyright 2000, As The World Spins Corp. All

rights reserved. Please do not copy or use without authorization. For authorization [email protected].">

<!ENTITY trademark SYSTEM "http://www.worldspins.com/legal/trademark.xml">

]>

Page 90: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

90

Example (cont’d)<PRESSRELEASE><HEAD>Mini-globe revolutionizes keychain industry

</HEAD><LEAD>Today As The World Spins introduces a new approach to keychains. With the new MINI-GLOBE keys can be kept inside achain, called for upon demand, and stored safely. Never

more will consumers lose a key or stand at a door flipping through a stack of keys seeking the right one.

</LEAD><LEGAL>&trademark;&copyright;</LEGAL></PRESSRELEASE>

Page 91: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

91

Name Spaces

• Namespaces are standard DTDs

• More than one namespace can be used in the same XML document– Different elements of a given

document may conform to different namespaces

• Declaring the namespaces– Each namespace is identified by a URI

Page 92: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

92

Example• Defining the used namespace

<document xmlns:dbi= 'http://www.cs.huji.ac.il/dbi-schema'>

• Using a tag from the namespace<dbi:A>This is a text of an element A

according to dbi’s definition</A>

• Using a tag not from the namespace<A>This will probably be understood as an

anchor</A>

Page 93: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

93

DTD’s

Page 94: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

94

The Data File

Page 95: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

95

The Data File: shorthands

Page 96: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

96

<?XMLversion ="1.0" encoding="UTF-8" standalone="no"?>

<container xmlns:bi="www.cs.technion.ac.il/~oshmu/container.dtd"> <bi:bdb xmlns:bi="www.cs.technion.ac.il/~oshmu/nss.dtd"> <bi:book>

<title> Godzila</title><author>Jeff Cohen </author>

</bi:book><bk:book xmlns:bk="www.cs.technion.ac.il/~oshmu/namespaces.dtd">

<title>A Suitable Boy</title><price currency="US Dollar">22.95</price>

</bk:book> </bi:bdb></container>

Page 97: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

97

Using CDATA<HEAD1>

Entering a Kennel Club Member

</HEAD1>

<DESCRIPTION>Enter the member by the name on his or her papers. Use the NAME tag. The NAME tag has two attributes. Common (all in lowercase, please!) is the dog's call name. Breed (also in all lowercase) is the dog's breed. Please see the breed reference guide for acceptable breeds. Your entry should look something like this:

</DESCRIPTION>

<EXAMPLE><![CDATA[<NAME common="freddy" breed"=springer-spaniel">Sir Fredrick of Ledyard's End</NAME>]]>

</EXAMPLE>

We want to seethe text as is,even though

it includes tags

Page 98: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

98

Page 99: 1 XML – Extensible Markup Language DBI – Representation and Management of Data on the Internet.

99

Summary• XML is a new data format. Its main virtues:

– widespread acceptance – the (important) ability to handle

semistructured data (data without schema)

• DTDs provide some useful syntactic constraints on documents. As schemas they are weak

• How to store large XML documents?• How to query them?• How to map between XML and other

representations?