Top Banner
5 February 2008 Kaiser: COMS E6125 1 COMS E6125 Web-enHanced COMS E6125 Web-enHanced Information Management Information Management (WHIM) (WHIM) Prof. Gail Kaiser Prof. Gail Kaiser Spring 2008 Spring 2008
76

COMS E6125 Web-enHanced Information Management (WHIM)

Feb 01, 2016

Download

Documents

haven

COMS E6125 Web-enHanced Information Management (WHIM). Prof. Gail Kaiser Spring 2008. Today’s Topic: Markup Languages. History of markup languages SGML = Standard Generalized Markup Language HTML = HyperText Markup Language XML = eXtensible Markup Language. What is Markup?. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 1

COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)

COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)

Prof. Gail KaiserProf. Gail Kaiser

Spring 2008Spring 2008

Page 2: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 2

Today’s Topic: Markup Languages

• History of markup languages• SGML = Standard Generalized

Markup Language• HTML = HyperText Markup Language

• XML = eXtensible Markup Language

Page 3: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 3

What is Markup?• Special text (“mark”) that is added to

the regular text of a document in order to convey some information about it

• A markup language is a formalized way of providing markup, and specifies:– what markup is allowed (the lexicon) – what markup is required – how markup is distinguished from content

text – what the markup “means”

Page 4: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 4

Specific Coding• Historically, electronic manuscripts

contained procedural control codes (markup) that caused the text to be formatted in a particular way– tj6– troff– TeX

Page 5: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 5

Procedural Markup• Advantages:

– Instructs agent how to process text – Generally concerned with formatting and

presentation – Is “efficient” because requires little further

interpretation• Disadvantages

– Often specific to one proprietary processing system – Usually ties a document to a single purpose

• printing on a paper • viewing on a screen • provides no information on “meaning”

Page 6: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 6

Markup Steps1. Author first analyzes the information structure

and other attributes of the document; that is, s/he identifies each meaningful separate element, and characterizes it as a paragraph, heading, ordered list, footnote, or some other element type

2. Author then determines, from memory or a style book, the processing instructions (“marks”) that will produce the format desired for that type of element

3. Finally, s/he inserts the chosen marks into the text

Page 7: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 7

Example Specific Coding

.SK 1 Text processing and word processing systems typically require additional information to be interspersed among the natural text of the document being processed. This added information, called "markup", serves two purposes:

.TB 4 TaB stopTaB stop

.OF 4 OFfsetOFfset

.SK 1 1.#Separating the logical elements of the document; and .OF 4 .SK 1 2.#Specifying the processing functions to be performed

on those elements. .OF 0 .SK 1 SKipping vertical spaceSKipping vertical space

Page 8: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 8

Generic Coding• In contrast, generic (or

generalized, or descriptive) coding uses descriptive tags (e.g., “heading”)– Scribe– LaTeX– HTML

Page 9: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 9

Descriptive Markup• Advantages:

– Identifies the logical components of a document

– Generally concerned with what text is – Does not specify what procedures are

to be applied to text – Therefore requires that other

process(es) supply formatting and presentation

Page 10: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 10

Descriptive Markup• Disadvantages

– Is (usually) human and machine readable

– Identifies information content – Is not directed towards a particular

purpose or rendition of the document – Therefore can be non-proprietary

Page 11: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 11

Markup Steps1. Author first analyzes the information

structure and other attributes of the document; that is, s/he identifies each meaningful separate element, and characterizes it as a paragraph, heading, ordered list, footnote, or some other element type same as abovesame as above

2. Author then associates each significant element with the mnemonic tag (“mark”) that s/he feels best characterizes it

Page 12: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 12

Example Generic Coding

<p> Text processing and word processing systems typically require additional information to be interspersed among the natural text of the document being processed. This added information, called <em>markup</em>, serves two purposes:

<ol> <li>Separating the logical elements of the

document; and <li>Specifying the processing functions to

be performed on those elements. </ol>

Page 13: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 13

The Case for Generalized Markup

• Markup should describe a document's structure and other attributes rather than specify processing to be performed on it, so markup need be done only once and will suffice for all future processing

• Markup should be rigorous so that the techniques available for rigorously-defined objects like programs and data bases can be used for processing documents as well

Page 14: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 14

Who Invented Markup?• Specialized markup: ???• Generalized markup:

– Many credit William Tunnicliffe, chairman of the Graphic Communications Association Composition Committee, who presented a talk on the separation of information content of documents from their format during a meeting at the Canadian Government Printing Office, September 1967

– Others credit Stanley Rice, a New York book designer, who proposed the idea of a universal catalog of parameterized editorial structure macros in several articles, e.g., "Editorial Text Structures," Memorandum to Standards Planning and Requirements Committee, ANSI, March 17, 1970

Page 15: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 15

An Early Implementation

• At IBM in 1969, Charles Goldfarb, Ed Mosher and Ray Lorie invented Generalized Markup Language (GML) as part of a law office project integrating text editing with information retrieval and page composition

• Instead of a simple tagging scheme, GML introduced the concept of a formally-defined document type (DTD = Document Type Definition) with an explicit nested element structure

• By 1971 developed first DTD, for the manuals for IBM's “Telecommunications Access Method”, which enabled all the headings of a given head-level to be automatically formatted identically

• Productized in 1973 in IBM’s Document Composition Facility (DCF)

Page 16: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 16

Example GML:h1.Chapter 1: Introduction :p.GML supported hierarchical containers, such as :ol :li.Ordered lists (like this one),:li.Unordered lists, and :li.Definition lists :eol. as well as simple structures. :p.Markup minimization (later generalized andformalized in SGML), allowed the end-tags to beomitted for the "h1" and "p" elements.

Page 17: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 17

SGML = Standard GML• Standardization effort started in 1978, when

ANSI (American National Standards Institute ) creates The Computer Languages for the Processing of Text Committee

• Series of draft standards 1980-1986 (1983 version adopted by IRS and DoD), ISO (International Standard Organization joins ANSI effort in 1984

• Final international standard in 1986 based in part on an SGML system developed by Anders Berglund, then of the European Particle Physics Laboratory (CERN)

• Hmm… isn’t CERN where Tim Berners-Lee invented the “World Wide Web” in 1989?

Page 18: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 18

SGML• A metalanguage (grammar) • How to write tags, how to define the document

structure• Structural paradigm is that of

– an inverted tree structure, a root component branching out into leaves

– or a series of nested containers • Defines three kinds of objects

– Elements are the basic structural components – Attributes are qualities of elements – Entities are a short representation of special

characters

Page 19: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 19

SGML Pro and Con• Advantages:

– Documents held in a standards-based, non-proprietary, platform-independent storage format

– Scope for document re-use and re-presentation, enhancement of retrieval possibilities

– Easy to process– Can (optionally) validate against DTDs

• Disadvantages:– Remained a niche market in the 1980s, unknown to

the masses– Not well supported by the major document processing

vendors, tools expensive

Page 20: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 20

Then Came the Web… • HyperText Markup Language

(HTML) is derived from SGML• As an SGML-compliant language, it

has a DTD with a fixed set of tags• Initially, the number of tags were

very limited ( ~ 10 ) and very easy to remember and to use

Page 21: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 21

HTML Example<html> <head> <title> My title </title> </head><body> <h1> A huge heading </h1> <h2> A smaller one </h2> <ul> <li> a list item in <b>bold</b> </li> <li> a list item in <i>italics</i> </li> </ul> <p> A paragraph </p> </body> </html>

Page 22: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 22

Another HTML Example• From original IETF Internet Draft

for HTML

See <A HREF="http://info.cern.ch/">CERN</A>'s information for more details.

A <A NAME=serious>serious</A> crime is one which is associated with imprisonment.

The Organization may refuse employment to anyone convicted of a <a href="#serious">serious</A> crime.

Warning: < IMG SRC ="triangle.gif" ALT="Warning:"> This must b e done by a qualified technician.

< A HREF="Go">< IMG SRC ="Button"> Press to start</A>

Page 23: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 23

HTML Pro and Con• Advantages

– Simple to learn and to use – Easy to create from scratch or by converting

legacy text files – Easy to parse and render

• Drawbacks– Syntaxless – Much more a presentation language than a

structural language – Too limited, not a good substitute for a word

processor

Page 24: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 24

HTML History

• 1990: First implementation by TBL on a NeXT computer at CERN – Used SGML tools to create original HTML

language (DTD, parser) – Scalability and simplicity of HTML (and HTTP),

compared to OHS or Gopher part of the basis for WWW success

• 1991-1992: Various text-only and graphical browsers developed, latter usually platform-specific

Page 25: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 25

HTML History• 1993: NCSA Mosaic

– First widely available graphical WWW browser (Unix X-Windows and Mac)

– Developed primarily by UIUC undergraduate Marc Andreessen

– The killer application of the Internet is born and the number of Web servers explode

• 1994: Competition– Mosaic team leaves NCSA to found Netscape – Microsoft adopts the Web (Internet Explorer bundled

with Windows 95) – Divergence of supported HTML tags between

Internet Explorer and Netscape –> browser wars– HTTP traffic becomes more common than telnet and

ftp

Page 26: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 26

HTML History• 1994-1995: HTML 2.0 adds image

maps, forms• 1995 and beyond: Commercial websites

– Java development started (as “Oak”) for programming settop boxes in 1991, BIG FAILURE - but launched on Web in March 1995 (in HotJava) and May 1995 (in Netscape), BIG SUCCESS

– Amazon.com opens in July 1995– “dot com” era begins (and soon ends)

Page 27: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 27

HTML History• Jan 1997: HTML 3.2 adds tables,

applets, text flow around images, superscripts and subscripts

• Dec 1997: HTML 4.0 adds frames, cascading style sheets, more multimedia options, scripting languages, web accessibility conventions, internationalization

Page 28: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 28

XHTML = eXtensible HyperText Markup Language

• XHTML 1.0 W3C Recommendation January 2000, revised August 2002 (XHTML 1.1 still working draft)

• Made element and attribute names case-sensitive (in particular, use lowercase)

• Include end tags, e.g., <p> … </p>• Add a “/” to empty elements, e.g., <br/> and

<hr/> • Quote all attribute values, e.g.,

<img src="duck.jpg" alt="A Duck"/> • Most browsers still work fine with older HTML

Page 29: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 29

Where did the “X” come from?

• XML = eXtensible Markup Language• XHTML is a reformulation of HTML 4.x in XML• XHTML can be used in conjunction with other

XML vocabularies – SMIL (Synchronized Multimedia Integration

Language) – SVG (Scalable Vector Graphics)– MathML (Mathematical Markup Language)– Plus hundreds dedicated to specific applications

(the extensible part)

Page 30: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 30

What is XML for?• The universal markup format for

structured documents and data on the Web

• For data exchange (messages) and persistent data

• Syntax• Data Modeling • Data Processing

Page 31: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 31

XML History• XML 1.0 became a W3C Recommendation

in February 1998, revised several times - most recently September 2006

• XML 1.1 draft released Nov 2003, recommendation last revised September 2006 (addresses various issues wrt Unicode and mainframe compatibility)

• Conceptually an SGML descendant• Unlike SGML, it quickly became

widespread

Page 32: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 32

SGML->XML• Like SGML, XML is a grammar (or a

metalanguage), NOT a specific language • Specification simplified

– SGML spec ~600 pages– XML spec 36 pages (initial 1.0) ->

54 pages (1.1 2nd edition)• Parsing made simpler through two-level

mechanism– Well-formed– Valid

Page 33: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 33

Well-Formed• (Optionally) starts with XML declaration

<?xml version="1.0"?>• Rest of document inside the root element

<myroot>…</myroot>• All text contained in some element

<someelement>text text text</someelement>• Explicit empty elements

<anotherelement></anotherelement><anotherelement/>

Page 34: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 34

Well-Formed• Element tags must be properly nested (no

crossing tags)NO <i><b>blah blah blah</i></b>

• Start and end tags must match exactly (same case)

• Quotes placed around all attribute values<a href=“stuff.html”>stuff</a>

Page 35: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 35

Valid• Well-formed, plus• Conforms to a DTD or Schema

– tags and attributes are all declared– tags and attributes are used correctly

• XML browsers and editors usually require validity

• Other tools might not (e.g., search engines)

Page 36: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 36

XML Goes Beyond Document Processing

• XML more oriented to distributed computing than to document markup

• Thus complements rather than replaces HTML (or XHTML)

• DOM = Document Object Model

• SAX = Simple API for XML

• SOAP = Simple Object Access Protocol

• Web Services

Page 37: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 37

Let’s Reinvent XML• Someone in the far future sends a message in

a virtual bottle, containing parts of the universal library of human and post-human literature, back into the 1970s when ...

• … the Web, XML, P2P, Java were unheard of• ... computer manufacturers talked about mips

and kilobytes• … music was played by rotating vinyl discs

under a diamond-tip stylus or on cassette tapes

Page 38: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 38

… and Microsoft looked like

Page 39: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 39

The Message in the Bottle, 1st tryÐÏ^Qࡱ^Zá^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@>^@^C^@þÿ^@

^F^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@#^@^@^@^@^@^@^@^@^P^@^@%^@^@^@^A^@^@^@þÿÿÿ^@^@^@^@"^@^@^@ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿì¥Á^@q^@^D^@^@^@^R¿^@^@^@^@^@^@^P^@^@^@^@^@^D^@^@Ç^G^@^@^N^@bjbjt+t+^@^@^@

^@Some Quotations from the Universal Library^M1 Famous Quotes^M1.1 By William I^M[2, Sonnet XVIII]^MShall I compare thee to a summer's day?^MThou art more lovely and more temperate.^MRough winds do shake the darling buds of May,^MAnd summer's lease hath all too short a date.^MSometime too hot the eye of heaven shines,^MAnd often is his gold complexion dimmed.^MAnd every fair from fair some declines,^MBy chance or nature's changing course untrimmed.^MBut thy eternal summer shall not fade,^MNor lose possession of that fair thou owest,^MNor shall Death brag thou wander'st in his shade^MWhile in eternal lines to time thou growest.^MSo long as men can breathe, or eyes can see,^MSo long live this, and this gives life to thee.^M1.2 ^M[2] W. Shakespeare. The Sonnets of Shakespeare.609.^M^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

Page 40: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 40

The Message in the Bottle, 2nd try\documentclass{article} \begin{document} \title{Some Quotations from the Universal Library} ...\section{Famous Quotes} \subsection{By William I} \textbf{\cite[Sonnet XVIII]{shakespeare-sonnets-

1609}} \begin{verse} Shall I compare thee to a summer's day?\\ Thou art more lovely and more temperate. \\ Rough winds do shake the darling buds of May, \\ … \end{verse} \bibliographystyle{abbrv} \bibliography{msg} \end{document}

Page 41: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 41

The Message in the Bottle, finally<?xml version=“1234.56"?> <universal_library> <books> <book> <title>Some Quotations from the Universal

Library</title> <section> <title>Famous Quotes</title> <subsection> <title>By William I</title> <quote bibref="shakespeare-sonnets-1609"> <title>Sonnet XVIII</title> <verse> <line>Shall I compare thee to a summer's day?</line> <line>Thou art more lovely and more temperate.

</line> <line>Rough winds do shake the darling buds of May,

</line> … </verse>

</section> </book> … </books></universal_library>

Page 42: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 42

XML as a Self-Describing Data Exchange Format

• Someone from the 1970s receives the message in the virtual bottle, and it …

• … can be easily “understood” (even using CP/M & edlin)

• … can be parsed easily• … allows the application programmer to

rediscover schema and semantics (sort of…)• … may include an explicit schema

description• … allows separation of marked-up content

from presentation

Page 43: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 43

XML Anatomyelement name element

element content<bibliography>

<paper ID= “goto”> <authors> <author>Edsger W. Dijkstra </author> </authors> <title>Go To Statement Considered Harmful</title> <booktitle>Communications of the ACM</booktitle> <year>1968</year> <fullPaper source=“harmful”/> </paper>

</bibliography>

attribute name

attribute value(attributes cannot contain elements)

empty element character contentnumber content

Page 44: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 44

Perspectives on XML• Document (SGML) Community

– data = linear text documents– markup (annotate) text to describe context, structure,

semantics• Database Community

– XML as a prominent example of the semi-structured data model

– captures the whole spectrum from highly structured, regular data to unstructured data

XML is the cure for your data exchange, information integration, e-commerce, … problems” (also cures baldness, lose 28 pounds in 14 days, get rich quick, …)

Page 45: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 45

Pure XML - Instance Model

<A> <B>foo</B> <C>bar</C> <C>psl</C></A>

A

B C

"foo" "bar"

C:"bar"

A:

B: "foo"

C:"psl"

"psl"

C

children are ordered

• XML 1.0 implicit data model (infoset): – nested containers ("boxes within boxes")– labeled ordered trees (= semistructured data

model)– relational, object-oriented easy to encode

Page 46: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 46

Identifying Vocabularies

• My element may not be your element: – geometry context:

<element>line</element> – chemistry context:

<element>oxygen</element>

Page 47: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 47

Identifying Vocabularies

• An XML Schema (with XML 1.1) defines a vocabulary of names of type definitions, element and attribute declarations [Schema ~= new improved DTD]

• Use XML Namespaces (with XML 1.1) to identify which vocabulary– Simple method for qualifying element and

attribute names used in XML documents– Useful when a single XML document contains

elements and attributes that are defined for and used by multiple software modules

Page 48: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 48

Namespace Scoping• XML namespaces

are declared with an xmlns attribute, which can associate a prefix with the namespace

• The declaration is in scope for the element containing the attribute and all its descendants

<html:html xmlns:html='http://www.w3.org/1999/xhtml'>

<html:head> <html:title>Frobnostication </html:title></html:head><html:body> <html:p>Moved to

<html:a href='http://frob. example.com'>here.</html:a>

</html:p></html:body></html:html>

Page 49: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 49

Namespace Defaulting<?xml version="1.1"?>

<!-- elements are in the HTML namespace, in this case by default -->

<html xmlns='http://www.w3.org/1999/xhtml'> <head> <title>Frobnostication</title> </head> <body> <p>Moved to <a href='http://frob.example.com'>here</a>.</p> </body></html>

Page 50: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 50

Multiple Namespaces

<bk:book xmlns:bk='urn:loc.gov:books'          xmlns:isbn='urn:ISBN:0-395-36341-6'

xmlns:money='urn:Finance:AllAboutMoney'>

<bk:title>Cheaper by the Dozen</bk:title><isbn:number>1568491379</isbn:number>

<bk:price money:currencySymbol="$">99.99</bk:price>

</bk:book>

All element types are prefixed

Page 51: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 51

Namespace Defaulting with Multiple Namespaces

<book xmlns='urn:loc.gov:books'       xmlns:isbn='urn:ISBN:0-395-36341-6'> <title>Cheaper by the Dozen</title> <isbn:number>1568491379</isbn:number>

</book>

Unprefixed element types are from books

Page 52: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 52

Nested Scoping<?xml version="1.1"?><!-- initially, the default namespace is "books" --><book xmlns='urn:loc.gov:books'

      xmlns:isbn='urn:ISBN:0-395-36341-6'><title>Cheaper by the Dozen</title><isbn:number>1568491379</isbn:number><notes>

<!-- make HTML the default namespace for some commentary -->

<p xmlns='urn:w3-org-ns:HTML'>          This is a <i>funny</i> book!      </p></notes>

</book>

Page 53: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 53

How to Define the Actual Namespace

• W3C namespace specification doesn’t say (!)• A namespace doesn’t actually have to exist as

a physical or conceptual entity• All that is needed is a qualifier—the XML

namespace URI — that, in combination with an element type or attribute name, creates a universal (and universally unique) name

• In other words, there doesn’t actually have to be a definition or anything else at that URI

Page 54: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 54

XML Namespaces

• Allows mixing of different tag vocabularies

• Only identifies the vocabulary (lexicon)

• Additional mechanisms required for structure and meaning of tags

Page 55: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 55

Processing XML• Non-validating parser:

– checks that XML doc is syntactically well-formed

• Validating parser:– checks that XML doc is also valid wrt a

given XML Schema (or, historically, DTD)

Page 56: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 56

Processing XML• Tree representation:

– Document Object Model (DOM) API– Cursor APIs, e.g., .NET’s XPathNavigator

, Java StAX• Stream of events representation:

– Push Model, e.g., Simple API for XML (SAX)

– Pull Model, e.g., Common API for XML Pull Parsing (XmlPull)

• Others

Page 57: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 57

Document Object Model

• Object-oriented approach to traversing the XML document as a tree

• Typically loads the entire XML document into memory (random access but memory intensive)

• Provides mechanisms for loading, saving, accessing, querying, modifying, and deleting nodes from an XML document

Page 58: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 58

DOM API• Hierarchy of Node objects mapping to XML

concepts: document, element, attribute, processing instruction, comment, …

• Language-independent API:– get first/last child, previous/next sibling, set of

nodes– insert before/after, replace– getElementsByTagName

• W3C DOM offers fairly limited functionality, so implementations often add helper method extensions

Page 59: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 59

Push Model• XML producer (typically an XML parser)

controls the pace of the application and informs the XML consumer when certain events occur (e.g., reports events when encountering begin/end tags)

• XML consumer registers callbacks with the producer, which invokes the callbacks as various parts of the XML document are seen (as events are reported)

• Does not necessarily build a parse tree

Page 60: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 60

Push Model Pro• The entire XML document does not need to be

stored in memory, only the information about the node currently being processed is needed

• This makes it possible to process large XML documents without incurring massive memory costs

• Can also process XML streams whose contents arrive over time

• Allows consumer to ignore less interesting data

Page 61: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 61

Push Model Con• Certain context and state information such as

the parents of the current node or its depth in the XML tree must be tracked by the programmer

• Limited expressive power (query/update) when working on streams

• To register callbacks one needs to create a class devoted to handling events from the producer

• Many developers find callbacks to be an unintuitive way to control program flow

Page 62: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 62

Pull Model• XML Consumer controls the program flow

by requesting events from the XML producer as needed

• Operates in a forward-only, streaming fashion while only showing information about a single node at any given time

• Programmer creates a loop that continually reads from the XML document until the end of the document is reached, but acts solely on items of interest as they are seen

Page 63: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 63

Pull Model Comparison• As memory efficient as push model processing

but with a more familiar programming model• Does not require a specialized class for

handling XML processing to implement specific interfaces or subclass certain classes to register callbacks

• The need to explicitly track application states using boolean flags and similar variables is significantly reduced

Page 64: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 64

XML Cursors• Cursor acts like a lens that focuses on one XML

node at a time, but, unlike pull-based or push-based APIs, the cursor can be positioned anywhere along the XML document at any given time

• Allows one to navigate, query, and manipulate an XML document loaded in memory

• Does not require the heavyweight interface of a traditional tree model API, where every significant token in the underlying XML must map to an object

• Can create XML views of non-XML data

Page 65: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 65

Other Alternatives• Object to XML Mapping APIs

– Represent nodes and text as classes and programming language primitives

– Cannot represent all XML information with full fidelity, e.g., lose processing instructions and comments, element ordering

– Impedance mismatches between XML Schema and object-oriented concepts

• XML-specific languages – XPath, XQuery, XSLT, …

Page 66: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 66

Summary• Webpages intended for human audience

usually written in HTML, where descriptive markup is interpreted by browser

• Webpages intended for machine processing (other than browser) usually written in some XML vocabulary understood by both the producer and the consumer

Page 67: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 67

Second Assignment: Revised Paper

Proposal• Due Monday February 18th at 5pm• Maximum three pages (not including

figures, if any), plus references (required)• Plan and outline your paper (which will be

~15 pages)• See

http://york.cs.columbia.edu/classes/cs6125/revised_paper_proposal.htm

Page 68: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 68

Revised Paper Proposal• Each full paper should have title, author,

abstract (~200 words), introduction, body sections, conclusions, bibliography (cited references)

• The point of this assignment is to determine what will be in those sections

• Assume a reader who is taking the class but may not know anything at all about your specific topic

Page 69: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 69

Revised Paper Proposal: Introduction

and Conclusion• What is your topic?• What is the problem being addressed?• What is the solution, or design space of

solutions, proposed or actualized?• What is your argument?• What is your point of view?• What is the opposing point of view?

Page 70: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 70

Revised Paper Proposal: Body Sections

• What sections? (usually 3-5)• What subsections? (perhaps down to

subsubsections)• Motivate your literature reading to fill

those sections• Full paper will be due March 14th

Page 71: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 71

A Note about Citations and Bibliographic

References• References should be cited in the text

like this “Kaiser said blah blah [1]” or this “[Kai07] describes mumble”

• Bibliography entry should appear something like this[Kai07] Gail Kaiser, COMS E6125 Web-enHanced Information Management, Columbia University Department of Computer Science, 2007, http://york.cs.columbia.edu/classes/cs6125/.

Page 72: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 72

Second Assignment: Logistics

• Due Monday February 18th by 5pm• Maximum three pages when printed (not

including optional figures and required reference list)

• Submit by posting in Revised Paper Proposal folder on CourseWorks

• Must be in a format I can read, which means pdf, word, powerpoint, html, plain ascii text (with all figures embedded or viewable in an ordinary browser)

Page 73: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 73

Heads Up on Project• Preliminary Proposal due Monday March 10th

(note this is before the full paper)• Optionally work in teams (see

http://york.cs.columbia.edu/classes/cs6125/team_advice)

• Build a new system or extend an existing system – submit code, demo system

• OR evaluate/compare one or more existing system(s) – submit procedures and findings, show system(s)

• You may "continue" your paper topic towards the project, or do something entirely different

Page 74: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 74

Heads Up on Presentation• Individual ~10 talk in class during one

of last few class sessions• No proposal, just do it• May be based on paper, project, or

some other topic (in the case of team members all presenting on the same project, please coordinate to avoid redundancy and discuss your plans with the instructor in advance)

Page 75: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 75

Reminders

• Class participation is important! (10% corresponds to a whole letter grade)

• Revised paper proposal due February 18th

• Preliminary project proposal due March 10th

• Paper must be individual, projects may optionally be done in teams

Page 76: COMS E6125 Web-enHanced Information Management (WHIM)

5 February 2008 Kaiser: COMS E6125 76

COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)

COMS E6125 Web-COMS E6125 Web-enHanced Information enHanced Information Management (WHIM)Management (WHIM)

Prof. Gail KaiserProf. Gail Kaiser

Spring 2008Spring 2008