Top Banner
The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group
44

The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

The XML Trial:FINDING of FACTS

Arnaud Sahuguet, Chief InquisitorPenn Database Research Group

Page 2: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Preliminary Remarks

Page 3: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

What is XML used for

• Messages (XML-RPC)• Text content (HTML, WML)• Data Content (FinXML, BioML)• Documents (DocBook)• Component serialization (Java Beans)

Everything!

Page 4: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

XML applications have different properties/requirements

• Order vs. No order• Notion of “equivalence” between documents• Nested vs. Flat• Structure vs Semi-structure

Page 5: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

XML and DTDs are 2 distinct issues

• XML does not need DTDs (well-formedness)• The structure of an XML document can be described

using other representations– grammars– schemas

• Questioning DTDs does not mean questioning XML itself• XML is just a mark-up after all

Page 6: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

DTDs

Page 7: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

What is a DTD[ISO 8879]

A document type definition specifies:• the generic identifiers (GIs) of elements that are

permissible in a document of this type• for each GI, the possible attributes, their range of values

and defaults• for each GI, the structure of its contents, including:

– which element can occur and in what order– whether text characters can occur– whether non character data can occur

• The purpose of a DTD is to permit to determine whether the mark-up for an individual document is correct and also to supply markup that is missing, because it can be inferred unambiguously from other mark-up present.

Page 8: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

What is a DTD (cont’d)

• A DTD contains– element declarations– attribute declarations– entity references– entity parameters– notations– processing instruction (<? …. ?>)

• Elements are defined according to content-model [?+*,]• Attributes can be CDATA, NMTOKEN• Attributes can be optional (#IMPLIED) or mandatory

(#FIXED)• Mix content corresponds to (PCDATA|xxx)*• Notations are a way to describe the content of an entity

reference (e.g. jpg picture)

Page 9: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

What is a DTD (cont’d)<!ELEMENT title (#PCDATA)>

<!ELEMENT info (metadata+)>

<!ELEMENT metadata EMPTY><!ATTLIST metadata owner CDATA #REQUIRED>

<!ELEMENT folder (title?, info?, desc?, (%nodes.mix;)*)><!ATTLIST folder %node.att; folded (yes|no) #FIXED 'yes' >

<!ELEMENT bookmark (title?, info?, desc?)>

<!ATTLIST bookmark %node.att; %url.att;>

<!ELEMENT desc (#PCDATA)>

<!ELEMENT separator EMPTY>

<!ELEMENT alias EMPTY><!ATTLIST alias ref IDREF #REQUIRED>

<!NOTATION jpg PUBLIC ‘-//JPG’>

<!ENTITY folder SYSTEM “folder.jpg” NDATA jpg><!ENTITY bookmark SYSTEM “bookmark.jpg” NDATA jpg>

<!ENTITY % local.node.att ""><!ENTITY % local.url.att ""><!ENTITY % local.nodes.mix ""><!ENTITY % node.att

"id ID #IMPLIEDadded CDATA #IMPLIED%local.node.att;">

<!ENTITY % url.att"href CDATA #REQUIREDvisited CDATA #IMPLIEDmodified CDATA #IMPLIED%local.url.att;">

<!ENTITY % nodes.mix"bookmark|folder|alias|separator%local.nodes.mix;">

<!ELEMENT xbel (title?, info?, desc?, (%nodes.mix;)*)><!ATTLIST xbel %node.att; version CDATA #FIXED "1.0">

Page 10: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

What is the role of a DTD

• Constrain structure SCHEMA• Declare entities MODULARITY• Provide some default values for attributes

Page 11: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

XML vs SGML

• No tag omission• no exceptions• restriction for mixed content• no AND (&) operator• no distinction betweem CDATA and RCDATA• in SGML, 39 types of attributes

Page 12: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

How are DTDs being used

Page 13: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Methodology of the survey

• Harvesting– xml.org

• Cleansing– missing elements, typos, etc.

• Normalization– expansion of entities– translation into our internal data-model

• “Mining”• Visualization

• Data Model– Node = list of Node, list of Attribute

Page 14: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Issues that have not been looked at

• Are DTDs being used– well-formed vs valid documents

• How do documents actually used DTDs– what is the meaning of * or +

Page 15: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Most DTDs are not correct

• missing elements• wrong syntax• incompatible attribute declarations

If the DTD itself is not correct,how can we hope to validate any document?

Page 16: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

DTDs are not always a connected graph

• This is only at the level of the document that the root is defined.

• Use of ANY

Page 17: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Encoding of tuples

• Given the absence of the AND content-model, most DTDs represent tuple <a,b,c> as (a|b|c)*.

• The correct syntax would be:– SGML: ( a & b & c )– XML: (a,b,c) | (a,c,b) | (b,c,a) | (b,a,c) | (c,b,a) | (c,a,b)

Page 18: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

“|” is used and overused

Page 19: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Encoding Inheritance

• Parameter entities are used to capture “syntactic” inheritance

Page 20: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Some features are almost never used

• Attribute types like NMTOKEN• IDREF• Notations• Processing instructions

Page 21: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Global comments

• DTDs are full of mistakes• DTDs -- when expanded -- are really messy• People tend to pick a “type” much larger than they

really need

Page 22: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

What’s wrong with DTDs

• too much document oriented– DTDs have been designed to interface with text processing

tools

• too simple and too complicated at the same time• too limited to represent complex structures• IDREFs are not typed• No notion of record• No notion of inheritance/sub-typing• Content-model ambiguous• too many ways to represent the same thing• names are global, not locals• no obvious way to offer versioning, extension, evolution

Page 23: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Improvements

Page 24: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

type-checkingconstantsmacrosvoid*void*header file#ifdefstandard librarynamespace

validationentity referenceentity parameter ANY IDREF DTD conditional

section key entities namespace

Analogy XML/ProgLang

It is interesting to remark that features like inheritance, type inference, polymorphism or modules are missing

Page 25: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Analogy XML/ProgLang (cont’d)

XML Functional Object Oriented===================================

======| variant, union type abstract class, record with order ordered inheritance& record inheritance

? option null+/* list list

Give a look at Phil Wadler’s talk.

Page 26: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Immediate Solutions

• Remove ANY• DTD = single rooted connected graph• Support for “&”• Need for DTD validators• Forget about:

– notations– conditional sections– ID IDREF (as they are)

I thought I was a too drastic,but on the xml-dev mailing,

there is an even more brutal proposal7(XML2.0alpha, by Rick Jeliffe).

Page 27: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

What is the future of DTDs

Page 28: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Family Tree of Schema Languages for Markup Languages (Rick Jelliffe © 1999)

Page 29: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

What should DTDs be used for

• validation• efficient XML storage (persistency extension, or

database storage)• optimization of path-expression queries• documentation• design of DTDs extensions (to resolve shortcomings)• efficient parsing• design of supporting tools

Page 30: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

How DTDs could be used

• Like software components/libraries– import, export– inheritance– over-riding

• Is there a need for a DTD/schema repository?

Page 31: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Major Challenges

• Combining text and data processing in a unified framework– Structured query vs document query– Markup algebra

• Doing “versioning” the right way– subtyping– need for backward/forward compatibilty

Page 32: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

My message

• XML needs some modern PL tools/features

• XML and Java– Java is far from perfect but it established some features

like gc, threads, distributed computation as sine qua non requirements of a programming language

– XML should try to do the same for text/data processing

• XML is not an abstract thing. People are using it and we should keep that in mind.

Page 33: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

On-going research

Page 34: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

XML algebra

• ECFG (Franck Neven)• XML model (Phil Wadler)• Semi-structured schema (Beeri/Milo)• Deterministic Data Model (Penn)• Union-types (Peter,Benjamin)

• Mark-up Algebra– WebL– Algebra for querying Text Regions (Consens/Milo)– Nested Text-Region Algebra (Jaakola, Kilpeläinen) + sgrep

Page 35: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

XML semantics

• Path constraints (Jérôme Siméon, Wenfei Fan)• XSLT and XPath (Phil Wadler)• Extending DB constraints for Codi to XML (Penn)• F-Logic

Page 36: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

From XML to XML query languages

• XPath and XSLT (Phil Wadler)• XSLT (Franck Neven)• XML-QL• UnQL

Page 37: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Misc.

Page 38: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Mapping from OO to XML (POQL/INRIA)

• Class Person tuple(name: string, age: integer, spouse:Person)

• <!ELEMENT person (name, age, person?)<!ATTLIST id ID #IMPLIED

spouse IDREF #REQUIRED>

<PERSON id=p1 spouse=p2> <NAME>Vassilis</NAME> <AGE>32</AGE> <PERSON> <NAME>Irène</NAME> <AGE>29</AGE> </PERSON></PERSON>

<PERSON id=p1 spouse=p2> <NAME>Vassilis</NAME> <AGE>32</AGE></PERSON><PERSON> <NAME>Irène</NAME> <AGE>29</AGE></PERSON>

Page 39: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Schematron (Rick Jelliffe)

• Idea: encoding structure using tree constraints• Not based on grammars but on tree patterns• Semantics

– find a context node in the document– check for constraints (I.e. XPath expressions)

• Features– in the spirit of XSL (patterns, rules)– based on Xpath

• Benefits– a “schema” specification can have more or less refined– supports variations of the schema (versions, etc.)

Page 40: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Example<!-- +//IDN sinica.edu.tw//DTD Schematron 1.0a//EN --><!ELEMENT schema ( title?, pattern+ )><!ELEMENT assert ( #PCDATA )> <!ELEMENT pattern ( rule+ )> <!ELEMENT report ( #PCDATA )><!ELEMENT rule ( assert | report )+><!ELEMENT title ( #PCDATA )><!ATTLIST schema ns CDATA #IMPLIED ><!ATTLIST assert test CDATA #REQUIRED > <!ATTLIST pattern name CDATA #REQUIRED see CDATA #IMPLIED > <!ATTLIST report test CDATA #REQUIRED ><!ATTLIST rule context CDATA #REQUIRED >

<schema> <title>Demonstration Patterns for the Schematron Itself</title> <pattern name="The Open Schematron DTD 1.0"> <rule context="schema"> <assert test="pattern">A schema element should contain at least one pattern elements.</assert> </rule> <rule context="pattern"> <assert test="rule">A pattern element should contain at least one rule elements.</assert> <assert test="@name">A pattern element should have an attribute called name.</assert> </rule> <rule context="rule"> <assert test="assert | report ">A rule element should contain at least one assert or report elements.</assert> <assert test="@context">A rule element should have an attribute called context. This should be an XPath for selecting nodes to make assertions and reports about.</assert> </rule> <rule context="assert"> <assert test="@test">An assert element should have an attribute called test. This should be an XSLT expression.</assert> </rule> <rule context="report"> <assert test="@test">A report element should have an attribute called test. This should be an XSLT expression.</assert> </rule> </pattern></schema>

Page 41: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Example (cont’d)<schema> <pattern name="The Closed Schematron DTD 1.0a"> <rule context="schema"> <assert test="count(*) = count(pattern | title)">Unexpected element(s) found: a schema element

should contain only pattern elements.</assert> <assert test="pattern">A schema element should contain at least one pattern element.</assert> <report test="phase">The element phase is only used in the 1.2 DTD</report> </rule> <rule context="pattern"> <assert test="count(*) = count(rule)">Unexpected element(s) found: A pattern element should contain

only rule elements.</assert> <assert test="rule">A pattern element should contain at least one rule elements.</assert> <assert test="@name">A pattern element should have an attribute called name.</assert> </rule> <rule context="rule"> <assert test="count(*) = count(assert | report ) ">Unexpected element(s) found: a rule element

should contain only assert and report elements.</assert> <assert test="assert | report ">A rule elemement should contain at least one

assert or report elements.</assert> <assert test="@context">A rule element should have an attribute called context. This should be an XPath for selecting nodes to make assertions and reports about.</assert> <report test="key">The element key is only used in the 1.2 DTD</report> </rule> <rule context="assert"> <assert test="@test">An assert element should have an attribute called test. This should be an XSLT expression.</assert> <report test="name">The element name is only used in the 1.1 DTD</report> </rule> <rule context="report"> <assert test="@test">A report element should have an attribute called test. This should be an XSLT expression.</assert> <report test="name">The element name is only used in the 1.1 DTD</report> </rule> </pattern></schema>

Page 42: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

Anonymous Content Types (Rick Jelliffe)

• EMPTY and ANY are built-in content types• What about offering some new ones

• SINGLE (only 1 child, no data element)• PAIR (two children of different types, no data element)• PAIRS (multiple of PAIR)• SAME (zero or more elements of the same type)• LEAF (1 empty sub-element or PCDATA)• LEAVES (multiple of LEAF)• UNIQUE (any number of elements; one type appears

once)• NONRECURSIVE (element cannot contain itself)

Looks like polymorphism to me :-)

Page 43: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

A new data-model for XML...

• Node, NodeList• Only “&”, “+” and “?” are allowed

<?xml version="1.0" ?><!DOCTYPE box [<!ELEMENT box ( box )* ><!ATTLIST box id ID #REQUIRED length-breadth-width NMTOKENS #REQUIRED units NMTOKEN #REQUIRED >]><box id="b1" length-breadth-width="3 5 8" units="cm"><box id="b2" /></box>

Page 44: The XML Trial: FINDING of FACTS Arnaud Sahuguet, Chief Inquisitor Penn Database Research Group.

…and what you can do with it

• Some attributes always go together– ( id & ( unit & length-breadth-width)?)

• IDREFS can be modeled as ( terminal-node & Element )– household IDREF can be specified as

(mother? & father? & child* & grandparents* & grandchild* & unmarried-sibling* & refugee* & pet* & ghost*)