SGML, HTML, XML: Do We Really Need All That? ISMT Multimedia Fall 2002 Dr Vojislav B Mišić.

Post on 19-Dec-2015

215 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

Transcript

SGML, HTML, XML:Do We Really Need All That?

ISMT MultimediaFall 2002Dr Vojislav B Mišić

Lecture Overview What is a markup language? HTML markup: what’s good, what’s wrong Extensions to HTML (dHTML and style sheets, XML and

XSL, …) XML

Basic elements Well-formed vs. valid XML Writing a DTD Examples of XML

Markup languages What is markup?

Text (actual contents of the document) is interspersed with markings

Markup is related to the text notes on the content notes on text presentation but virtually anything can be marked (remember Fermat’s

last theorem?) Markup language allows separation of concerns: content

vs. presentation

Standards for markup SGML (IBM) – a standardized way to write other markup

languages (actually, a meta-language) SGML-based language is specified using a DTD (Document

Type Definition) SGML is not really a user-friendly language, hence its use

was rather limited, even though software support for it does exist

Other markup languages

TeX (Knuth) is another widely used markup language Performs extremely well for complex texts with

mathematical formulas and symbols cross-references different typefaces foreign language

A TeX example\begin{equation}\label{coh1} \Psi (S) = \displaystyle \frac{\displaystyle \sum_{x \in R (S)} \left( \# S_w (x) - 1 \right)} {\displaystyle \sum_{x \in R (S)} \left( \# S - 1 \right)}\end{equation}

HTML HTML (HyperText Markup Language) is the language of the

Internet Allows platform-independent browsing Text-only at first, media later Hyperlinks, limited visual formatting However, it is far from perfect, and is gradually being

replaced (current version: 4.01)

HTML markup First you write the text, then add appropriate markup tags Tags can describe logical entities

Headings of different levels: H1, H2, … Lists and list elements (UL, OL, LI)

But tags can describe visual effects (display rendering) Bold and italic text (B, IT) Font and typeface changes

If you make an error… Anything not recognized as correct HTML is essentially

ignored HTML browser just treats it as plain text and displays it

directly In this manner, users are still able to see most of the

source, albeit without proper formatting Your opinion: is this good or bad?

HTML editing HTML source is ASCII and essentially layout independent

Plain text editors can be used You can put extra white space to your heart’s content, with no

effect on what is displayed by the browser Most browsers allow you to view and save the HTML

source of the document displayed – the quickest way to learn HTML

HTML is interpreted – editing changes are displayed (almost) instantly

HTML on the Internet HTML browsers can display graphics and other media

objects Although HTML by itself provides only the most primitive

support for multimedia Tags can specify target URLs (hyperlinks) Error tolerance ensures that anyone with a browser (any

browser) can access HTML documents … all of which made HTML the language of choice for

hypertext on the Internet

More HTML features Visual formatting is allowed but not forced

you can specify a typeface, but the browser will substitute another one of its own choice if the one specified is not available

User can easily change the presentation just resize window and select different fonts/sizes

Browser differences (IE vs. Navigator) – actually, not very important any more

HTML Interactivity Interactivity at first limited to hyperlinks Forms introduced later (Navigator 3) Form support still limited, most often a client- or server-

side scripting is required Proliferation of scripting languages

CGI scripts JavaScript and Jscript (more details later) Vbscript, ASP perl

Is HTML a Good Markup Language? Logical and visual formatting capabilities together

Some people argue for cleaner separation of logical from visual formatting

Others want more author control Many extensions (some proprietary) Changes generally lean towards greater author control

over document rendering – more direct formatting instructions included

Dynamic HTML Commercial term – there is no such thing as a dHTML

standard Combination of HTML with new technologies

Stylesheets add greater author control Scripting allows improved interactivity, including user input Even simple animations are possible

As always, not quite compatible extensions by Microsoft and Netscape

HTML styles In standard HTML, logical markup tags (such as <H1>)

have predefined properties for Typeface Font size Mode Line spacing

Properties cannot be changed, and we cannot define our own tags

The only way is to use a (possibly way too long) sequence of appropriate primitive tags every time – not a very convenient solution

Stylesheets to the rescue Cascaded stylesheets (CSS): cleaner separation of markup

from actual content Style: a named set of properties that define presentation

of a chunk of text (character, paragraph, …) Styles are present in text processing software (WinWord)

but in some markup languages as well (TeX) CSS is used with HTML, but it’s not HTML – although

browsers know how to handle them together

CSS Syntax A CSS-compatible stylesheet contains a set of rules, each

with a selector (name), a number of properties and their values

Rules can be Inline (within a HTML tag, in document body) Embedded (in the head of a HTML document) External, in a separate file which is then linked or imported

into a HTML document Position of the rule defines the scope of its effect on the

document

CSS Selectors HTML selectors – text portions of HTML tags Class selectors – can be applied to any HTML tag ID selectors – usually applied only once per page to a

particular HTML tag Type of HTML tag defines the scope of CSS properties

Block level (DIV, LI, H1) Inline (B, FONT, TT) Replaced tags (IMG)

CSS Properties Always of the form property:value; Categories of properties control

Typefaces (fonts, size, mode) Text (kerning, leading, alignment) Lists (bullets, indentation) Colors (borders, text, rules, background) Margins Positioning of individual elements

CSS Rule with a HTML selector Effective redefinition of HTML tags, e.g.:

B { fonts: bold 18pt times,serif; text-decoration: underline;}

Redefines the <B> (boldface) tag throughout the rest of the document

Don’t forget to close the brace!

CSS Rule with a class selector Independent style, applicable to any HTML tag:

.extra { font-size: 28pt; }

.huge { font-size: 48pt; }

Class selector must be referred to within the HTML tag:

<B class="extra">Extra</B><B class="huge">HUGE</B>

CSS Rule with a class selector May be linked to a specific HTML tag:

p.mini { font-size: 8pt; }p.big { font-size: 14pt; }

Class selector may be applied to this HTML tag only:

<P class=“mini">mini</P><P class=“big">BIG</P>

CSS Rule with an ID selector Another independent style, applicable to any HTML tag:

#area1 { position: relative; margin-left: 9em; color: red; }

ID is specified within the HTML tag:

<SPAN ID="area1"> ... </SPAN>

More on CSS selectors Several CSS selectors may share the same definition, and

individual selectors may get additional properties separately

CSS rules can refer to tags nested within other tags, e.g.,

P B { background: pink; }

redefines the <B> tag only when encountered within the <P> tag

Adding CSS to your document Within a style container in the document head:

<HEAD><STYLE TYPE="text/css"><!-- CSS rules go here--></STYLE></HEAD>

HTML comment tags hide the CSS rules form non-CSS browsers

Importing CSS into your document Create a separate file, stylefile.css, then write

<HEAD><LINK REL=stylesheets TYPE="text/css“ HREF="stylefile.css“></HEAD>

Several files may be added in this manner

More on CSS Single line comments start with // Multiline comments between matched pairs of /* and */ A stylesheet file may import another stylesheet file (hence

the name CSS) with the statement

@import url(stylefile)

But: the last rule listed wins! Also: beware of browser differences!

More CSS capabilities Font selection Text control List properties Background properties Absolute and relative positioning (but this is very

dangerous!) Visibility (which probably has little use by itself – but it can

be quite useful when changed though appropriate scripts) Stacking (vertical) order

Document Object Model

DOM describes the structure of HTML HTML document as a hierarchy

Thus allowing a script written in a suitable language to access and manipulate only selected element (or elements) within that document

document.images.b1.src="button_on.gif" describes a path from root or top (which is the document itself) to a particular element – an image file

Then, a script can manipulate this element (e.g., hide, show, replace, move, …) in response to certain events

XML eXtended Markup Language: a simplified (easier, more

consistent) version of SGML XML-compliant languages defined with appropriate DTDs XML parsers signal syntax errors (unlike HTML) – use of

authoring tools implied current uses (with more to follow)

SMIL for synchronized multimedia RDF for resource definition exchange

What is XML? A method for putting structured data in a text file Data stored on disk can be in binary or text format

Binary formats are often more concise Text format allows human inspection

XML is a set of rules/guidelines/conventions for designing text formats for such data, to produce files that are Easy to generate and read (by a computer) Unambiguous and platform-independent Extensible, easy to localize/internationalize

XML looks like HTML but isn't HTML XML makes use of

tags (words bracketed by '<' and '>') and attributes (of the form name="value")

HTML specifies what each tag & attribute means (and often how the text between them will look in a browser)

XML uses the tags only to delimit pieces of data – and leaves the interpretation to the application

XML is text, but isn't meant to be read XML files are text files, but they are not made for human

readers Text format allows experts (such as programmers) to more

easily debug applications Text format allows the use of a simple text editor to fix a

broken XML file Rules for XML files much stricter than for HTML Applications are not allowed to try to second-guess the

creator of a broken XML file – if the file is broken, just stop and issue an error message

XML is verbose, but that is not a problem XML is a text format and uses tags to delimit the data Therefore, XML files are nearly always larger than

comparable binary formats But disk space isn't as expensive anymore as it used to be,

and compression/decompression can be fast and reliable Communication protocols can compress data on the fly,

thus saving bandwidth as effectively as a binary format

XML is … good XML is license-free XML is platform-independent XML is well-supported Choosing XML is a lot like choosing SQL

you still have to build your own database and your own programs/procedures that manipulate it

but there are many tools available and many people that can help you

XML isn't always the best solution, but it is always worth considering …

XML is a family of technologies XML: the specification that defines what "tags" and

"attributes" are Xlink describes a standard way to add hyperlinks to an

XML file CSS is applicable to XML as it is to HTML XSL: an advanced language for style sheets (presentation

and manipulation) XSLT: a transformation language SMIL: Synchronized Multimedia Modeling … and others

Well-formed vs. valid XML Well-formed vs. valid XML Well-formed documents comply with XML well-formedness

constraints, which require that Elements properly nest within each other Elements use other markup syntax correctly

XML allows you to use elements of your own naming: ESSAY, SECTION, PARAGRAPH, NOTE, IMPORTANT

… unlike HTML, which forces all documents into a fixed document type

Writing XML One, Two XML Declaration: declares the nature of XML documents to

document readers <?xml version="1.0" standalone="yes"?> <?xml version="1.0" standalone="no"?> <?xml version="1.0“

standalone="no“ encoding="UTF-8"?>

Root element: contains all other elements (i.e., the rest of the document)

Root element is synonymous with your document type Root element cannot be repeated

An XML example

<?xml version="1.0" standalone="yes"?> <TRIVIA><MATH><QUESTION>What is the square root of 25</QUESTION><ANSWER>5</ANSWER></MATH> <GENERAL><QUESTION>What is the season after Summer</QUESTION><ANSWER>Fall</ANSWER><ANSWER>Autumn </ANSWER></GENERAL></TRIVIA>

Rules for XML elements All elements must have opening and closing (start and

end) tags<MATH> ... </MATH>

There are exceptions – tags like<QUESTION ... />

Case matters – CML is case-sensitive Proper tag nesting must be observed You can add whitespace to your heart’s content – it is

ignored in processing

XML Writing Describe content with elements of your own naming Invent a new element each time you introduce content

that significantly differs from any previous More elements = greater control you will have later, when

you use it Add attributes to elements Attributes describe the content or behavior of elements

Another Example

<?xml version="1.0" standalone="yes"?><HELP><TITLE>XML Help</TITLE>

<QUERY area="XML"><QUESTION>Where do I start?</QUESTION><ANSWER>Start with your root element. Break your document down into parts, fill them in, repeat.</ANSWER></QUERY>

<QUERY area="XML"><QUESTION>Are my element names are well chosen?</QUESTION></HELP>

XML Writing 4 Parsing: checking well-formedness

<PRICE>$57.80</PRICE><PET><CAT type="Cornish Rex">Cat nests properly within PET.</CAT></PET>

<WEATHER>Foggy no closing tag<LEVEL>Intermediate<LEVEL> improper tag<PASSWORD>planetB612</PASSWD> wrong spelling<DISTANCE TYPE=KM 120</DISTANCE>

missing closing bracket<CAR><engine>engine does not nest properly within CAR</CAR></engine> improper nesting

Valid XML Valid XML—unlike well-formed one—requires a Document

Type Definition DTD: a set of rules that a particular document type must

follow The rules state the name and contents of each element,

and the contexts in which a particular element can and must exist

DTD enables communication with databases Valid XML documents may be accompanied by style sheets

for proper presentation

What’s in a DTD Two essential structures: the element and the attribute Root element: contains all other elements Contents of other elements defined recursively starting

from the root, until you reach text-level elements, e.g.,<!ELEMENT NAME CONTENT>

Elements may have attributes, which are defined within the element definition, or separately, e.g.,<!ATTLIST ELEMENT-NAME NAME CDATA #IMPLIED>

Writing a DTD

<!ELEMENT novel (preface,chapter+,biography?,criticalessay*)>

<!ELEMENT preface (paragraph+)>

<!ELEMENT chapter (title,paragraph+,section+)>

<!ELEMENT section (title,paragraph+)>

<!ELEMENT biography (title,paragraph+)>

<!ELEMENT criticalessay (title,section+)>

<!ELEMENT paragraph (#PCDATA|keyword)*>

<!ELEMENT title (#PCDATA|keyword)*>

<!ELEMENT keyword (#PCDATA)>

DTD Declarations (1):Element type declaration Each element type includes a name, content, and possibly

a set of attributes A document can contain many conforming elements of

that type Sequence: ordered list of components (,) Choice: alternative components (|) Components may be optional (?) Components may be required and repeatable (+) Components may be optional and repeated (*)

Mixed-content declarations must include #PCDATA , parsed character data (i.e., text) as their first member

DTD Declarations (2):Attribute List Declarations Much more variation here String type attributes (CDATA): virtually unconstrained text

strings Enumeration attributes: require a list of options to pick

from Attribute defaults:

#REQUIRED, required; #IMPLIED, optional; #FIXED "value", a fixed value, "value", a default but overridable value

Usage:<ELEMENT-NAME NAME="value">

An Attribute List Example

<!ELEMENT MEMO (TO,FROM,SUBJECT,BODY,SIGN)><!ATTLIST MEMO importance (HIGH|MEDIUM|LOW) "LOW"><!ELEMENT TO (#PCDATA)><!ELEMENT FROM (#PCDATA)><!ELEMENT SUBJECT (#PCDATA)><!ELEMENT BODY (P+)><!ELEMENT P (#PCDATA)><!ELEMENT SIGN (#PCDATA)><!ATTLIST SIGN signatureFile CDATA #IMPLIED email CDATA #REQUIRED>

XML Writing

Add an XML declaration Valid XML documents must include the appropriate DTD

either as a set of internal definitions, or<!DOCTYPE NAME SYSTEM [ definitions ]> as a reference to an external DTD file, <!DOCTYPE NAME SYSTEM "file“ > or both simultaneously<!DOCTYPE NAME SYSTEM "file“ [ definitions ]>

DTD enables the parser to check validity of the document (errors are NOT permitted!)

Writing and Parsing Valid XML First suggestion: use a specialized editor Lots of choices, some of which are free Second suggestion: use a validating parser Again, lots of choices are available, mostly in Java, some in

C++, perl, JavaScript IE5 includes an XML parser (not quite up to the standard,

yet) XML interfaces to be included in standard DBMS systems:

Oracle, DB2, MS SQL Server

SMIL Synchronized Multimedia Integration Language based on XML specification, endorsed by W3C

http://www.w3.org/TR/PR-smil integration of a set of independent media objects into a

synchronized presentation enables authors to describe

temporal behavior of a presentation spatial layout of the presentation hyperlinks between media objects

Basic elements of a SMIL specification smil element can have an id attribute, and it can contain

body and head children elements head contains information not related to temporal behavior head can contain the following children: layout, switch

(but not both), and meta (zero or more) layout determines how the elements in the body are

positioned on an abstract rendering surface (audio or visual) if no layout is specified, the rendering is implementation

dependent Alternative layouts specified with a switch element

Basic elements (III) each element has an id and a type element type specifies the layout language used in the

layout element (default: text/smil-basic-layout) the default type information contains region and root-layout

elements non-default type information is simply character data

SMIL basic layout is a subset of the visual rendering model only positionable media object elements are controlled by

the SMIL basic layout

A region example

A text element is set to a 5 pixel distance from the top border of the rendering window: <smil> <head> <layout> <region id="a" top="5" /> </layout> </head> <body> <text region="a" .../> </body></smil>

Meta attributes define properties of a document each meta element specifies a single property/value pair

the list of properties is open-ended authoring tools should ensure that all meta elements have

a title with meaningful description information related to temporal and linking behavior of the

document Parallel/sequential playback of the children Complex synchronization possible Synchronization alternatives possible

Hyperlinking elements navigational links between elements links are unidirectional and single-headed SMIL supports name fragment identifiers and the '#'

connector (just like HTML – http://foo.com/some/path#anchor1)

the a element used as in HTML – associates a link with a complete media object only New link (presentation) can replace the old one New link (presentation) can be added to the old one New link (presentation) can pause the old one

Summary XML is “HTML done right” Widespread use in many areas: web publishing, document

processing, multimedia, B2B electronic commerce … Tools added daily Database connection: crucial for success

XML links www.w3c.org http://www.software.ibm.com/xml/ http://msdn.microsoft.com/xml/ www.xml.org www.xml.com …

top related