Top Banner
Structured Data 1. HTML 2. XML 3. XHTML 4. JSON 5. XMLSchema
23

Structured Data

Feb 24, 2016

Download

Documents

Gibson Gibson

Structured Data. HTML XML XHTML JSON XMLSchema. Structured Data. Machine processable data needs to be structured There are many examples Properties files: h ost= example.com p ort=8080 p rotocol=https Comma Separated Values: host,port,protocol example.com,8080,https - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Structured Data

Structured Data

1. HTML2. XML3. XHTML 4. JSON5. XMLSchema

Page 2: Structured Data

Structured Data• Machine processable data needs to be structured• There are many examples• Properties files:

host=example.comport=8080protocol=https

• Comma Separated Values:

host,port,protocol example.com,8080,https• These are examples of ‘flat files’• hard to model composite structures

Page 3: Structured Data

HTML and XML• Derivatives of Standard Generalized Markup Language (SGML).• Offer machine readable, yet machine independent means of conveying

information• Use the angle bracket syntax (<>) to structure the document.• Based on a tree-structure:

root

siblings

<html><head></head><body> <p> hello world </p></body>

</html>

child

Page 4: Structured Data

Elements and Attributes• Elements are structural• Attributes qualify elements

attribute<html><head></head><body bgcolor=“red”> <p> hello world </p></body>

</html>

element

Page 5: Structured Data

Hypertext Markup Language (HTML)

• Its primary purpose is to convey information to a browser for human consumption:– <p>, <bold>, <italic>, <pre> etc.

• It does contain other tags that are not presentational.• Like one for metadata:

– <meta>• And ones that are structural:

– e.g. <head>, <body>, <div>, <span>• And some that are sort of in between:

– e.g. , <ol>, <ul>, <h1>, <title>• HTML can embed information:

– e.g. <img>, <object>• It can also contain style and script content in the header:

– <style>, <script>• Most importantly, it can link to other resources via the anchor tag and href

attribute:– e.g. <a href=“http:// example.com/otherpage.html”>

Page 6: Structured Data

HTML• HTML Example describing a book

<h1>The Cat in the Hat</h1><br><p>by Dr Seuss</p><ul>

<li>Publisher: HarperCollins</li><li>Genre: Children’s Fiction</li><li>Year: 2003</li><li>ISBN: 0-00-715853</li>

</ul>

<br>visit the website <a href=“http://harp.co.uk”>here</a>

Page 7: Structured Data

HTML• The main limitations of HTML are:

– Fixed set of tags– Focus on presentation

• Like the Web, it is primarily for human consumption– Not all HTML is ‘well-formed’, i.e. it breaks the tree structure

• The classic case is orphan <br> tags. Strictly speaking, a tag must either contain child tags, or be an empty tag (<br/>).

• During the browser wars mostly between M$ and Netscape, browsers became very forgiving of invalid markup to recruit users.

• This is just about OK when dealing with a fixed set of presentational tags, free market economics permitting

• But not sustainable and not good for machine parsing

Page 8: Structured Data

Extensible Markup Language (XML)

• XML is (e)xtensible.– You can create your own tags which means– Tags can be understood in semantic terms:

• e.g. <book> contains <author> – XML MUST be well-formed (no structural

inconsistencies like <br>)– validation against a Document Type Definition

(DTD) or XML Schema or RelaxNG document is easier because it is well-formed.• These define what a particular document can contain,

e.g. a book element MUST contain >= 1 author elements

Page 9: Structured Data

XML• XML Example of a book

<?xml version="1.0"?> <book>

<title>The Cat in the Hat</title><author>Dr Seuss</author><isbn>0-00-715853<isbn><genre>Children’s Fiction</genre><published>2003</published><publisher> <name>HarperCollins</name> <url>http://harp.co.uk</url></publisher>

</book>

Page 10: Structured Data

XML Pros• Plain text

– Human readable– Create/edit in standard text editor (if you really want to)

• Self-Describing, Structured Data– Extensible tag language– Machine readable– Can be validated against DTDs and Schema

• Presentation independent– Unlike HTML– Format to other languages using transformations (e.g.

XSLT)• Programming language independent

– Java, C, C++, Visual Basic, Perl…• Simple to parse• Widely used in many domains and for many purposes

Page 11: Structured Data

XML Cons

• The main limitations of XML are:– Verbose way of describing data– How do you include binary data (e.g. images)?

• (work in progress and not ubiquitously supported)– A proliferation of DTD and Schema types because

anyone can create their own tags• Lots of processing time for each new XML doc and

DTD/Schema you come across• New software components to understand the new XML

docs (their semantics not structure)• How do I know if your <author> tag means the same as

my <author> tag?

Page 12: Structured Data

XML Namespaces• This last issue is addressed through namespaces

– Allows a tag to be qualified by a URI:<a:author xmlns:a=“http://andrew/namespace”>

<s:author xmlns:s=“http://sue/namespace”>

• Now I can tell the difference between the two author tags :-)• But the XML is more complicated :-(• And what happens if I change the definition of my author tag?• I suppose I better change the namespace:

prefix namespace

<a:author xmlns:a=“http://andrew/namespace/v1”>

• That’s better :-)• But now every client that understood the previous namespace is

broken :-(

binding

Page 13: Structured Data

RDF XML example<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/”> <foaf:Person rdf:about="#AL"> <foaf:name>Archibald Leach</foaf:name> <foaf:mbox_sha1sum>cf2342293...</foaf:mbox_sha1sum> <foaf:knows> <foaf:Person> <foaf:name>Katharine Hepburn</foaf:name> </foaf:Person> </foaf:knows> </foaf:Person></rdf:RDF>

Page 14: Structured Data

XHTML• In between HTML and XML

– It is valid HTML and valid XML• MUST be well-formed.• Fixed set of tags

– Makes use of HTML non-presentational tags.– Defers presentational concerns completely to

Cascading Style Sheets (CSS)• Instead uses element attributes to inject presentational

hints to the CSS:

<div class=“my-important-type”>I’m important</div>

Class attribute

Page 15: Structured Data

Cascading Style Sheets(CSS)• A rendering language that goes in the header of an HTML page

– Property based• element -type {presentation-key : value}

• CSS allows for extensibility!– I can define a class, and define rendering hints to the browser for that class:

<style type=“text/css”> .my-important-type {font-color: red}</style>And in the document:<div class=“my-important-type”>Hey wait!</div>

• Hey, wait!• at the same time as defining rendering hints to the browser, I’m also

classifying an element in the document.• Perhaps I can use this to support semantic information, not just rendering

information• So I could call my class .book and have elements inside it like .title

and .author. Hmm…

Page 16: Structured Data

XHTML example<head>

<title>My Book</title></head><body>

<div class=“book”><h1 class=“title”>The Cat in the Hat</h1><p>by <span class=“author”>Dr Seuss</span></p><ul> <li>Publisher: <span class=“pub”>HarperCollins</span></li> <li>Genre: <span class=“genre”>Children’s

Fiction</span></li> <li>Year: <span class=“year”>2003</span></li> <li>ISBN: <span class=“isbn”>0-00-715853</isbn></li></ul>

</div><p>visit the website at <a href=“http://harp.co.uk” class=“url” title=“http://harp.co.uk”>here</a>

</body>

Page 17: Structured Data

XHTML with some CSS• Here’s what it looks like in a browser

with a bit of CSS in the head of the HTML page:The important thing to take away here is that the data has not been lost through rendering.It looks nice for a human, but a machine can still extract the book properties

Page 18: Structured Data

HTML 5• Builds on HTML 4• A set of features, rather than a monolithic spec.• Not all browser support all features yet.• HTML 5 MUST be well-formed (XHTML)• Some core features:

– Canvas – drawing area– Video – embed directly – no need for plugins– Local storage– Multi-threaded Javascript– GEO location– Semantic tags – section, header, footer etc.– Micro data – embedded semantic metadata, e.g.

licencing, vCards and your own vocabs.

Page 19: Structured Data

HTML 5• Micro data – embedded semantic metadata, e.g.

licencing, vCards and your own vocabs.• You can create scopes on a tag:

<section itemscope itemtype="http://data-vocabulary.org/Person">

– Then mark up elements within the scope:<img itemprop="photo” src=“…”/>

<p itemprop=”name”>Andrew</p>

Then publish your vocabulary so people can use it.Publish in human readable for, and RDF for machine processing.

See http://html5demos.com/

Page 20: Structured Data

Javascript Object Notation (JSON)

• Another structured document type, not based on XML.• Instead uses properties, and nested curly braces to describe

data:{"location":

{"id": "WashingtonDC", "city": "Washington DC",

"venue": "Hilton Hotel, Tysons Corner", "address": "7920 Jones Branch Drive”

} }

• Essentially a dictionary• Supports number, string, boolean, array (list) and Object (map)• JSON can be parsed into a Javascript object using the

eval(string) method.• Popular because it is simpler than XML and natively understood

by browsers.

Page 21: Structured Data

XML Schema

• XML Syntax for describing how XML documents should be structured.– Has some built-in data types

• Allows for validation of an XML document

• Allows for code generation– Create objects in your favorite

programming language to manipulate XML documents

Page 22: Structured Data

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="urn:book" xmlns:bk="urn:book">

<xsd:element name="book" type="bk:Book"/>

<xsd:complexType name="Book"> <xsd:sequence> <xsd:element name="title" type="xsd:string"/> <xsd:element name="author" type="xsd:string"/> <xsd:element name=”isbn" type="xsd:string"/> <xsd:element name="genre" type="xsd:string"/> <xsd:element name=”published” type="xsd:date" /> <xsd:element name=”publisher" type=”bk:Publisher”/> </xsd:sequence> </xsd:complexType>

<xsd:complexType name=”Publisher"> <xsd:sequence> <xsd:element name=”name" type="xsd:string"/> <xsd:element name=”url" type="xsd:anyURI"/> </xsd:sequence> </xsd:complexType></xsd:schema>

Page 23: Structured Data

Structured Data

• Why use structured data?• Understand how structured data

encapsulates information• What are the strengths/weaknesses of

different types of structured data?