Top Banner
1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database Research Group University of Pennsylvania n
38

1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

1

Everything you ever wanted to know about DTDs but were afraid to askw

to create your XML start-up, go public and make money.

Arnaud SahuguetPenn Database Research Group

University of Pennsylvania

n

Page 2: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

2

!! Self Promotion !!• XMill* XML compression• XDuce XML Programming Language• Silk Route* XML views on top of relational data• XML-QL* XML query language• W4F HTML to XML translation

Page 3: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

3

What this talk is all about

For a change, it would be nice -- when one speaks about XML -- for him/her to look at the world as it is.

This on-going work simply looks at how people are expressing the structure of XML documents using DTDs.

There are two kinds of people.Those who look at the world as it isand wonder “why?”.And those who look at the world as it could beand wonder “why not?”.

(R. Kennedy)

Page 4: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

4

The XML map

XML users

researchindustry

Page 5: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

5

Everything you ever wanted to know about .... how to create your XML

start-up, go public and make money.$ how to create your XML start-up

$ raise funds

$ go public

$ and make money

Page 6: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

6

1. Pick a cool name

www.mind-your-own-dtd.com

DTD

and a cool logo

Page 7: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

7

2. Build the right team

Sanjeev Khanna, PhDVP Algorithms

DTD

Byron ChoiChief Technology Officer

DTD

Peter Buneman, PhDChief Scientist, XML Evangelist

DTD

Arnaud SahuguetFounder, CEO, XML Inquisitor

DTD

(XML knowledge is a +)

Page 8: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

8

3. Next steps (left as an exercise)

• Raise funds• Spend it all in advertising• Raise funds again (second round)• Go public• Sell your stock• Quit and repeat from step 1

single atomic transaction

Page 9: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

9

More seriously: outline• What are DTDs• Why bother• The survey• What’s wrong with DTDs• Replacements• Future Work

Page 10: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

10

SGML DTDs [ISO 8879]

A document type definition specifies:• the generic identifiers (GIs) of elements that are permissible in a document of

this type• for each GI, the possible attributes, their range of values and defaults• for each GI, the structure of its contents, including:

– which element can occur and in what order

– whether text characters can occur

– whether non character data can occur

Bottom line: the purpose of a DTD is to permit to determine whether the mark-up for an individual document is correct and also to supply markup that is missing, because it can be inferred unambiguously from other mark-up present.

Page 11: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

11

XML DTDs

Element declarations:<!ELEMENT title (#PCDATA)><!ELEMENT metadata EMPTY><!ELEMENT bookmark (title?,info,desc?,(bookmark|folder))*>

Attribute declarations:<!ATTLIST alias ref IDREF #REQUIRED><!ATTLIST folder folded (yes|no) #FIXED “yes”>

Entity References:<!ENTITY amp CDATA “&#38”><!ENTITY lt CDATA “&#60”><!ENTITY gt CDATA “&#62”><!ENTITY diff CDATA “&lt;&gt;”>

Entity references are referenced using &entityName;

Entity references are like constantsthat can be used in the XML document.

Other stuff I will deliberately and happily ignore:•Processing instruction: <? ….?>•Notations

Entity Parameters:<!ENTITY % url.att

"href CDATA #REQUIREDvisited CDATA #IMPLIEDmodified CDATA #IMPLIED%local.url.att;">

Entity parameters are referenced using %entityName;

Entity parameters are like macrosthat can be used only in the DTD itself.

They are not available to XML documents.

Page 12: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

12

XML DTDs (cont’d)• Elements are defined by a content-model

– sequence (“,”) or alternation (“|”) of sub-elements– “?” (optional), “*” (zero-or-more), “+” (one-or-more)– EMPTY (no sub-elements), ANY (any), PCDATA– mixed-content (alternation of PCDATA and something else)

• Attributes can be CDATA (text), NMTOKEN (tokens) or enumeration.

• Attributes can be optional (#IMPLIED) or mandatory(#REQUIRED).

• Attributes can be constant (#FIXED) and can be assigned a default value (#DEFAULT)

Page 13: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

13

A DTD for bookmarks (XBel)<!ELEMENT title (#PCDATA)>

<!ELEMENT info (metadata+)>

<!ELEMENT metadata EMPTY>

<!ATTLIST metadata

owner CDATA #REQUIRED>

<!ELEMENT folder (title?, info?, desc?, (%nodes.mix;)*)>

<!ATTLIST folder

%node.att;

folded (yes|no) #FIXED 'yes' >

<!ELEMENT bookmark (title?, info?, desc?)>

<!ATTLIST bookmark

%node.att;

%url.att;>

<!ELEMENT desc (#PCDATA)>

<!ELEMENT separator EMPTY>

<!ELEMENT alias EMPTY>

<!ATTLIST alias

ref IDREF #REQUIRED>

<!NOTATION jpg PUBLIC ‘-//JPG’>

<!ENTITY folder SYSTEM “folder.jpg” NDATA jpg><!ENTITY bookmark SYSTEM “bookmark.jpg” NDATA jpg>

<!ENTITY % local.node.att ""><!ENTITY % local.url.att ""><!ENTITY % local.nodes.mix ""><!ENTITY % node.att

"id ID #IMPLIEDadded CDATA #IMPLIED%local.node.att;">

<!ENTITY % url.att"href CDATA #REQUIREDvisited CDATA #IMPLIEDmodified CDATA #IMPLIED%local.url.att;">

<!ENTITY % nodes.mix"bookmark|folder|alias|separator%local.nodes.mix;">

<!ELEMENT xbel (title?, info?, desc?, (%nodes.mix;)*)><!ATTLIST xbel %node.att; version CDATA #FIXED "1.0">

Page 14: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

14

Insightful AnalogyC

type-checking

constants

macros

void*

void*

header file

#ifdef

standard library

namespace

XML

validation

entity reference

entity parameter

ANY

IDREF

DTD

conditional section

key entities

namespace

It is interesting to remark that features like inheritance, type inference, polymorphism or modules are missing.

Page 15: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

15

Why bother about DTDs?• XML exists without DTDs (well-formed/valid)• DB people do not pay attention to them anyway

Projects (as far as I know)

STORED Ignored. Data-mining instead.

Wisc. (VLDB99) Used after simplification.

XML-QL Not used for optimization.

XMill Not used.

CWI’s paper Not used.

Xyleme for document clustering + query formulation.

Niagara for document clustering + query formulation.

Excelon “DTD agnostic” (off-the-record)

Lore Dataguides.

W4F small DTD subset supported

Page 16: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

16

Information from DTD IS useful.• Validation

– e.g. runtime guarantees

• Query optimization– e.g. resolution of wildcards in regular path expressions (role

similar to dataguides)

• Storage/Compression– e.g. enum values for attributes, #elements, etc.

• Meta-information– documentation, retrieval, clustering

• DTD as a type system– language binding

Page 17: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

17

Harvesting

Cleansing

Normalization

Mining Visualization

from xml.org

missing elements, typos, etc. (BY HAND)

expansion of entitiestranslation into our internal data-model

AT&T GraphVizJavaBashsgrep

The survey: methodology

• Over 60 DTDs reviewed.• + Penn Web site where you can submit your own.

Patent Pending

Page 18: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

18

What we have been looking at• DTD well-formedness• Features being used• Structure (size): depth, #elements, #attributes• Structure (redundancy)• Content-model (type)• Content-model (complexity)• Recursivity• Number of roots• and many other features

Computing features is (relatively) easy.Finding out what to compute andfinding a meaning to them is the hard part.

Page 19: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

19

DTD “well-formedness”• Result

– most DTDs are incorrect: typos, missing declarations, invalid declarations, duplicate declarations

• Possible explanations– DTDs seem to be used more for documentation than for

validation or anything else.– No good tools available to help in DTD authoring. Tools are

now appearing (XML-Spy, XML-Authority).

• Issue– DTDs supposedly used to validate documents

Quis custodiet custodes ipsos?

Page 20: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

20

Well-formedness

Typos

syntax error

lack of knowledge on DTD

variant of DTD

external entities needed

cannot be corrected

Page 21: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

21

DTD unused features• Results

– notations and fancy attribute types are almost never used– ID and IDREF are rarely used (more later on)

– “+” is not often used

• Possible explanations:– DTD specification is CRYPTIC– DTD features are used as needed– un-typed references are not useful to people

• Issue– why not get rid of useless features?

36 out of 60 are not using IDREFs

Page 22: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

22

DTD sizes vary a lot• Result

– a huge spectrum

• Possible explanations– people are using DTDs to model almost everything

• Issues– the size of the DTD potentially influences the size of the

XML document– a “big” DTD makes validation more expensive– a “big” DTD implies a big corresponding type

min max avg med#elements 4 590 74 36#attributes 0 5,700 390 82#entities 0 190 15 2

Page 23: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

23

DTDs are redundant• What we mean by redundancy

– CM redundant: same content-model– redundant: CM redundant + same attributes

• Result– many DTDs are highly redundant– different elements can have the exact same content-model

• Possible explanations– since DTDs do not offer any mechanism for inheritance,

people are forced to re-declare things many times.

CM Redundancy RedundancyMin 0% 0%Max 95% 95%Avg 55% 34%Med 57% 37%

Page 24: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

24

Tuple encoding• Result

– people are misusing CM to encode DTDs– tuple(a, b, c)

• in SGML: ( a & b & c )

• in XML: (a,b,c) | (a,c,b) | (b,c,a) | (b,a,c) | (c,b,a) | (c,a,b)

• in practice: (a|b|c)*

• Possible explanations– DTDs do not offer a tuple construct and people had to find a

workaround.

• Issue– (a|b|c)* is a misleading representation of tuples: the type is

much larger than needed.

Page 25: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

25

Entities to model inheritance• Result

– inheritance is captured by entity parameters, but this is purely syntactic

• Possible explanations– this is the only way to avoid repeating the same CM– this is the only way to make CM extensible

• Issues– this notion of inheritance is misleading, specially when tuples

are encoded using (a1|..|an)*

Page 26: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

26

DTD Art Gallery

Page 27: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

27

What’s wrong with DTDs?

– Too much document oriented (interface with text processing)– Too simple and too complicated at the same time

– No notion of record– IDREFs are not typed– No notion of inheritance/sub-typing– No notion of constraints– No obvious way to offer versioning, extension, evolution

– Content-model ambiguous– Too many ways to represent the same thing

Page 28: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

28

Replacement Candidates• Grammar-based approaches

– XML-Data/BizTalk, SOX, DCD– XML-Schemas– DSD

• Constraint-based approaches– Schematron– Relax

• Type-based approaches– XDuce– Haskell

• Other– UML

Page 29: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

29

XML-Schemas• The name itself is misleading• Official goal: to replace DTDs and XML-Data• Supported by all the major players• XML syntax• More expressive than DTDs

– atomic types– better semantic for the content-model (tuples are back!)– mechanisms to extend and restrict schemas– constraints (limited, based on XPath)

• XML-Schema = data types + structures + constraints*

Page 30: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

30

Schematron• Idea: encoding structure using tree constraints• Not based on grammars but on tree patterns• Semantics

– find a context node in the document– check for constraints (I.e. XPath expressions)

• Features– in the spirit of XSL (patterns, rules)– based on XPath

• Benefits– a “schema” specification can have more or less refined– supports variations of the schema (versions, etc.)

Page 31: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

31

Schematron (cont’d)

<schema>

<pattern name="The Closed Schematron DTD 1.0a">

<rule context="schema">

<assert test="count(*) = count(pattern | title)"></assert>

<assert test="pattern"></assert>

<report test="phase"></report>

</rule>

<rule context="pattern">

<assert test="count(*) = count(rule)"></assert>

<assert test="rule"></assert>

<assert test="@name"></assert>

</rule>

You can for instance “type” ID/IDREF relationships.

<!-- +//IDN sinica.edu.tw//DTD Schematron 1.0a//EN --><!ELEMENT schema ( title?, pattern+ )><!ELEMENT assert ( #PCDATA )> <!ELEMENT pattern ( rule+ )> <!ELEMENT report ( #PCDATA )><!ELEMENT rule ( assert | report )+><!ELEMENT title ( #PCDATA )><!ATTLIST schema ns CDATA #IMPLIED ><!ATTLIST assert test CDATA #REQUIRED > <!ATTLIST pattern name CDATA #REQUIRED see CDATA #IMPLIED > <!ATTLIST report test CDATA #REQUIRED ><!ATTLIST rule context CDATA #REQUIRED >

Page 32: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

32

XDuceSession 7 on Query Processing:

“XDuce, A typed XML Processing Language”

Haruo Hosoya, Benjamin C. Pierce

Univ. of Pennsylvania

Page 33: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

33

• There is only a “gentlemen’s agreement” between the application and its XML environment.

• Why do we need to go beyond that?– performance– static guarantees

• How do we create a tight contract between the application and its XML environment?

XML(input) Application

XML(output)

XML Bindings

Page 34: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

34

Conclusions• XML is just syntax.• For serious applications, XML documents need a

specification (a “soul”).• DTDs are simply not adequate for the job.• Even worse. People have been hacking DTDs, which

means that they often provide misleading information.– DTD inference is one way to solve the problem.

Page 35: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

35

Conclusion (cont’d)• By looking at these hacks, we can get a pretty clear

idea of what is needed.• The good news: most issues are addressed by

replacement candidates.• Bad news: important issues still not addressed:

– versioning– constraints– bindings

• Future work:– relation between DTDs and documents they describe– DTD categorization using NLP– metrics for DTD (what is a good DTD)

Page 36: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

36

Questions

Stay tuned:

http://xml.cis.upenn.edu/DTDi

Page 37: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

37

Page 38: 1 Everything you ever wanted to know about DTDs but were afraid to askw to create your XML start-up, go public and make money. Arnaud Sahuguet Penn Database.

38

Use of structural information(from XML Schema Requirements)

– Publishing and syndication– Electronic commerce transaction processing– Supervisory control and data acquisition– Traditional document authoring/editing governed by schema

constraints– Use schema to help query formulation and optimization– Open and uniform transfer of data between applications,

including databases– Metadata Interchange

– Language binding