1 ICS-FORTH May 25, 2001 The Utility of XML Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Heraklion, May 25, 2001 Center for Cultural Informatics
Dec 25, 2015
1ICS-FORTH May 25, 2001
The Utility of XML
Martin Doerr
Foundation for Research and Technology - HellasInstitute of Computer Science
Heraklion,May 25, 2001
Center for Cultural Informatics
2ICS-FORTH May 25, 2001
XML is
XML is a compromise between databases and free texts
It takes the better from both sides without being perfect on either side.
It is readable. It allows to disambiguate meaning.
It is simple.
It is rich enough to open a new systems paradigm.
3ICS-FORTH May 25, 2001
What is a Document ?
A composite statement : a unit relating known facts, items and categories with new knowledge - linguistic or by other media.
It has an inner logic: the pure rendered knowledge, independent from language and form.
It has a meaningful structure: The sequence, arrangement or linking used to render the inner logic.
It has a presentation: Structure and style to assist perception and impression
4ICS-FORTH May 25, 2001
A document
5ICS-FORTH May 25, 2001
The statements….
Diego Velasquez is Spanish.
Diego Velasquez lived 1599-1660.
Diego Velasquez painted “Juan de Pareja”.
“Juan de Pareja” is a painting.
“Juan de Pareja” has dimension 81,3X69,9cm
Juan de Pareja is Moorish.
Juan de Pareja is a painter.
Philipp IV sent Velazquez to Italy.
…..
6ICS-FORTH May 25, 2001
Another document
7ICS-FORTH May 25, 2001
What’s Wrong with HTML
<B>MONET, Claude<B><BR>Haystacks at Chailly at Sunrise<BR>1865<BR>Oil on canvas<BR>30 x 60 cm (11 7/8 x 23 3/4 in.)<BR>San Diego Museum of Art <BR><P><IMG SRC=“http://192.41.13.240/artchive/ m/monet/hayricks.jpg”>
If written properly, normal HTML may reflect document presentation, but it cannot adequately represent the semantics & structure of data
Artist Name
Date
Artifact Title
Dimensions
Material
Museum
Image Reference
8ICS-FORTH May 25, 2001
User Problems/ Design Reasons
Preserving info units: who said that / self-contained
Entering data:
what can I say, what should I say, how can I say it.
Rendering data: how to tell my child, the public…
Accessing data: querying, mediation
Reusing data: transmission to other environments, merging, evolution of local system, preservation for future use.
9ICS-FORTH May 25, 2001
In Technical Terms
Transformation under preservation of meaning
Correct adaptation of presentation without knowing meaning
Packaging information for presentation – “1 document”
Sequencing categories for data input.
Interpretation of intended meaning - searching
Automatic relating of common meaning – merging of different statements
10ICS-FORTH May 25, 2001
What’s wrong with
Free texts: Clear packaging, rendering for one target, not machine processable (poor querying, categories uncomprehensive), poorly reusable, no help to enter data, transform data..
HTML: Solves platform-independence of presentation, weak connection between meaning and presentation structure – not far better than free text.
Databases: Clear logical structure, categorization, machine processable, excellent querying, difficult presentation, transformation, merging, evolution, no information units
XML: Clear packaging, logical structure, machine processable if correctly used, clear separation and relation of meaningful structure and presentation.
Helpful to enter data, easy to extend, transform, present. Can be queried, structure not independent from user view.
11ICS-FORTH May 25, 2001
XML and databases
Databases:
Schema first: Prior to data, complete, inflexible analysis of all categories and their relations.
Table structures: indexes prepared, excellent consistency enforcement.
XML:
Data first; structure explanatory, can come second, need not be formalized, extensible, DTD’s can be combined
semi-structured: flexible, but reduced guarantee if a question can be answered, reduced consistency enforcement.
Embedded schema: each instance carries the schema it uses –
querying by parsing without index structures – ideal transport format.
12ICS-FORTH May 25, 2001
Data First, Embedded Schema
This document carries the interpretation with it. It is readable without knowledge of the schema.
<ARTIST> <NAME><FIRST>Claude</FIRST><LAST>Monet</LAST></NAME> <ARTWORK> <ARTIFACT> <TITLE>Haystacks at Chailly at Sunrise</TITLE> <DATE>1865</DATE> <MATERIAL>Oil on canvas</MATERIAL> <DIM Metric=‘cm’> <HEIGHT>30</HEIGHT><WIDTH>60</WIDTH></DIM> <DIM Metric=‘in’> <HEIGHT>11 7/8</HEIGHT><WIDTH>23 3/4</WIDTH></DIM> <LOCATION>San Diego Museum of Art</LOCATION> <IMAGE File=‘http://192.41.13.240/artchive/m/monet/hayricks.jpg’/> </ARTIFACT> </ARTWORK></ARTIST>
13ICS-FORTH May 25, 2001
What’s important
Data first: delayed analysis, preserves data.
Embedded schema: facilitates data transport, readable in the future.
Separation of semantics and presentation: enables information reuse.
Guides and controls data entry
Same meaning can be encoded in multiple formats:
DTD design depends on purpose: Transport, presentation, data entry…
14ICS-FORTH May 25, 2001
Useful Applications
Prescription for documentation / input
Data transfer between systems (“middle ware”)
Document bases with full query access.
Combine database with XML documents: mission-critical data in tables and DTD, rich extensible structures in DTD only.
Create data for long-term use: even machine readable from paper!
Create information sets for multiple presentation
15ICS-FORTH May 25, 2001
Final Remark
How to encode meaning without structure ambiguities:
=> use RDF/ RDFS
How to standardize meaning of element types (tags) ?
=> use ontologies – e.g. formulated in RDFS!