Top Banner
http://blog.inf.ed.ac.uk/da16 Informatics 1: Data & Analysis Lecture 9: Trees and XML Ian Stark School of Informatics The University of Edinburgh Tuesday 9 February 2016 Semester 2 Week 5
28

Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Oct 02, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

http://blog.inf.ed.ac.uk/da16

Informatics 1: Data & AnalysisLecture 9: Trees and XML

Ian Stark

School of InformaticsThe University of Edinburgh

Tuesday 9 February 2016Semester 2 Week 5

Page 2: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Student Survey Season !

ESES: The Edinburgh Student Experience Survey

http://www.ed.ac.uk/students/surveys

Please log on to MyEd to complete the survey.

Help guide what we do at the University of Edinburgh, improving yourfuture experience here and that of the students to follow.

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 3: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Teaching Weeks !

This is Inf1-DA Lecture 9, in Week 5.

Next week is Innovative Learning Week (ILW). All lectures, tutorials, labsand coursework are suspended for the week, and replaced by a series ofalternative events across the University.

http://www.innovativelearning.ed.ac.uk

Check the ILW calendar and sign up now: some activities run all week,some are one-off events.

The week after that, from Monday 23 February, is Teaching Week 6.

Not all courses have noticed that this is the week numbering scheme.

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 4: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Lecture Plan

XML — The Extensible Markup LanguageWe start with technologies for modelling and querying semistructured data.

Semistructured Data: Trees and XMLSchemas for structuring XMLNavigating and querying XML with XPath

CorporaOne particular kind of semistructured data is large bodies of written orspoken text: each one a corpus, plural corpora.

Corpora: What they are and how to build themApplications: corpus analysis and data extraction

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 5: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Lecture Plan

XML — The Extensible Markup LanguageWe start with technologies for modelling and querying semistructured data.

Semistructured Data: Trees and XMLSchemas for structuring XMLNavigating and querying XML with XPath

CorporaOne particular kind of semistructured data is large bodies of written orspoken text: each one a corpus, plural corpora.

Corpora: What they are and how to build themApplications: corpus analysis and data extraction

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 6: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

XML: Reading Around the Subject

For a very brief summary and sales pitch, read this short introduction:

World Wide Web Consortium (W3C).XML Essentialshttp://www.w3.org/standards/xml/core, W3C 2010.

For a more comprehensive introduction, see Chapter 2 of:

A. Møller and M. I. Schwartzbach.An Introduction to XML and Web Technologies.Addison-Wesley, 2006.

There are multiple copies of this book in the Main Library HUB section.The Library also offers this other book entirely online:

E. T. Ray.Learning XML.Second edition, O’Reilly 2003. http://edin.ac/1bOkwui

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 7: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

There’s More to Life than Structured Data

Relational databases record data in tables conforming to fixed schemas,satisfying various constraints about uniqueness and cross-referencing.

That can usefully capture real-world constraints in a way which supportsautomatic validation and efficient querying.

However it can also be helpful in some situations to structure data in aless rigid way. For example:

When the data has no strong inherent structure; or there is structure,but it varies from item to item;When we wish to mark up (annotate) existing unstructured data (say,English text) with additional information (such as linguistic structure,or meaning);When the structure of the data changes over time — perhaps as moredata accumulates.

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 8: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Trees

Often this kind of semistructured data is modelled using trees.

These are mathematical trees, not vegetation. You can recognise them bythe fact that they grow branches downwards from a root at the top.

Nature notesA tree is a set of linked nodes, with a single root node.Each node is linked to a set of zero or more children, also nodes inthe tree.Every node has exactly one parent, except for the root node whichhas none.A node with no children is a leaf; other nodes are internal.Two nodes with a common parent are sibling nodes.

Trees contain no loops, and from each node there is always exactly oneroute back up to the root.

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 9: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Know Your Trees

Root node Leaves and internal nodes

Parent of A Children of A

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 10: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Semistructured Data ModelsThere are several tree-like data models used with semistructured data.We shall work with the XPath data model, developed for semistructureddata represented using XML (of which more shortly).The next slide shows an example of data structured according to the XPathdata model, with a fragment of a geographical factbook or gazetteer.

Wikipedia: NuclearVacuum

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 11: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Sample Semistructured Data

/

Factbook

Data for other countriesCountry@code="SI"

Region

Feature

@type="Mountain"Spik

Feature

@type="Mountain"Triglav

Feature

@type="Lake"Bohinj

Name

Gorenjska

Capital

Ljubljana

Population

2,020,000

Name

Slovenia

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 12: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

XPath Node Types

Root Node: This is the root of the tree, labelled / .

Element Nodes: These are labelled with element names, categorising thedata below them. In this example the element names are: Factbook,Country, Name, Population, Capital, Region, and Feature.In the XPath data model, internal nodes other than the root arealways element nodes.The root node must have exactly one element node as child, calledthe root element. Here the root element is Factbook.

Text Nodes: Leaves of the tree storing textual information. In thisexample there are text nodes with strings "Slovenia", "2,020,000","Ljubljana", "Gorenjska", "Triglav", "Bohinj" and "Spik".

Attribute Nodes: . . .

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 13: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

More XPath Node Types

Attribute Nodes: Leaves of the tree assigning a value to some attribute ofan element node.

In the example, we use the @ symbol to identify attributes. In thiscase we see an attribute code for Country, and another attributetype associated with the Feature element, which is assigned thetext values "Lake" and "Mountain".

In the XPath data model, attribute nodes are treated differentlyfrom other node types. For example, although the parent of anattribute node is an element node, when we talk about the childrenof this parent node we generally don’t include the attribute nodes.

One aim of the XPath model, and the XML language, is that data shouldbe self-describing: lots of the data in these trees is there to giveinformation about the meaning of other data. People argue about this

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 14: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Understanding an XPath Data Tree

In a tree like this, the meaning of data at a text node depends on all theelement nodes that appear along the path from the root of the tree to thetext node, and on the values of their associated attributes.

We usually write these paths with a / separator, beginning at the root.For example, the path to the text node containing "Bohinj" is

/Factbook/Country/Region/Feature/

and the value of the type attribute of the associated Feature element is"Lake". This tells us that Bohinj is a feature in a region in a country in thefactbook, and that the type of feature is a lake.

Note that to get further information (such as the name of the country,Slovenia), we would need to follow another path from the relevantancestor element (in this case, the Country element).

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 15: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Understanding an XPath Data Tree

In a similar way the meaning of an element node depends on the path tothat node from the root of the tree.

For example, in the factbook a Name element node is used in twodifferent ways:

A path /Factbook/Country/Name/ leads to a text node containingthe name of a country.

A path /Factbook/Country/Region/Name/ leads to a text nodecontaining the name of a region.

All of this structure in an XPath data tree can be written out in plain textusing the Extensible Markup Language XML.

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 19: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

XML: Extensible Markup Language

XML is formal language for presenting the kind of semistructured data wehave just seen. It is a markup language in that it provides a way to markup ordinary text with additional information.

XML was developed in the 1990’s building on the Standard GeneralMarkup Language SGML and the Hypertext Markup Language HTML. Itaimed to be simpler than SGML, but more general than HTML.

Like SQL, XML has a textual format which is suitable for robustmachine-to-machine communication, by automatically generating andparsing data files, as well as being moderately human-readable.

(Compare, for example, the binary encoding in Abstract Syntax Notation One)

XML has become the standard mechanism for publishing data on the web.

As a human-readable data serialization language it does, however, havecompetitors such as JSON or YAML.

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 20: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

There’s a lot of it about

The “extensible” part of XML means it can be applied to all kinds ofsemistructured data, with customised versions for any number ofapplication domains. For example, all of the following are based on XML:

XHTML for web pagesSVG for scalable vector graphicsOOXML for Microsoft Office documents .docx, .pptx, .xlsx

MathML for writing mathematicsGLADE for GTK+ user interface descriptionsGML, the Geography Markup LanguageMusicXML for musical scoresFpML, the Financial Products Markup Language

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 21: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Sample Semistructured Data in XML

<Factbook><Country code="SI">

<Name>Slovenia</Name><Population>2,020,000</Population><Capital>Ljubljana</Capital><Region>

<Name>Gorenjska</Name><Feature type="Lake">Bohinj</Feature><Feature type="Mountain">Triglav</Feature><Feature type="Mountain">Spik</Feature>

</Region></Country><!−− data for other countries here −−>

</Factbook>

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 22: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

XML Elements

The building blocks of XML documents are elements, also called tags.

The content of a thing element is marked with a start tag <thing> atthe beginning and an end tag </thing> at the end.

Elements must be properly nested. For example:

<Country><Region> ... </Region></Country>

is acceptable XML, whereas

<Country><Region> ... </Country></Region>

is not.

Elements in XML are case sensitive, so <REGION> is different from<Region> and <region>.

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 23: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Content of XML Elements

Each element in an XML document has content:

The content of the Capital element

<Capital>Ljubljana</Capital>

is the text string "Ljubljana".

The Region element given earlier has as content one Name elementtogether with three Feature elements.

The root element Factbook has the whole document as content.

An element may possibly have empty content: <thing></thing>.This can be abbreviated by the single hybrid tag: <thing/>.

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 24: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Attributes for XML Elements

Any element can have descriptive attributes which provide additionalinformation about the element. For example:

<Feature type="Mountain"> ... </Feature>

declares that the attribute type of the given Feature element has valueMountain.

Attribute values are always enclosed in either single or double quotationmarks.

A single element may have multiple different attributes, each with its ownvalue declared in the element start tag:

<thing attr1="value1" attr2="value2" ... > ... </thing>

In designing an XML representation for semistructured data, there issometimes a tension between putting information in the content of anelement, or as one of its attributes.

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 25: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Matching XML with the Tree Model

Every XML document naturally represents a tree structure in the XPathdata model:

Each XML element corresponds to an element node of the tree.

The XML root element corresponds to the root element of the tree(the one below the root node).

The text content of an individual XML element corresponds to a childtext node of the corresponding element node in the tree.

An attribute definition in an element’s start tag corresponds to a childattribute node of the corresponding element node in the tree.

Of course, this correspondence is there because the data model wasdesigned that way.

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 26: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Other XMLicities

XML files can contain comments <!−− almost anything here −−>.The full XPath data model also has comment nodes.

Well-formed XML documents should all begin with a declarationsomething like

<?xml version="1.0" encoding="UTF−8"?>

Both XML and the data model allow for all kinds of processing instructionnodes also written <?...?>.

Because XML documents are plain text files, there are some unexpectedconsequences in the tree structure:

Order of children matters (unless they are attributes)Whitespace sometimes matters but in ways too horrible to describe.

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 27: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Two Examples of XML

MusicXML

http://www.musicxml.com/tutorial/hello-world

Financial products Markup Language FpML

http://www.fpml.org http://is.gd/fpml59example

Ian Stark Inf1-DA / Lecture 9 2016-02-09

Page 28: Informatics 1: Data & Analysis - Lecture 9: Trees and XML · 2016. 2. 9. · MoreXPathNodeTypes AttributeNodes:Leavesofthetreeassigningavaluetosomeattribute of anelementnode. Intheexample,weusethe@

Homework !

Between this lecture and the next, on Friday, do the following.

Read XML Essentials from the W3Chttp://www.w3.org/standards/xml/core

Read Sections 2.1–2.5 of Møller and Schwartzbach; distributed byemail and available outside the ITO in Forrest Hill.

Find an SVG file and open it in a text editor to study its XML content.

Find a .docx file, and look at its XML content.

This format is in fact a zipped archive of XML files, so you will needto unzip it first. Depending on your platform, this may requirerenaming the .docx extension as .zip

Ian Stark Inf1-DA / Lecture 9 2016-02-09