Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Software Engineering Lecture 3Annatala Wolf

TREES AND XML

Mathematical ModelingWhen we consider types abstractly (from

the client perspective), we can use math to describe them in an unambiguous manner.int: integers (0, -14, 9000)double: real numbers (0.0, 3.14159,

-0.00008)boolean: Boolean value (true, false)char: “character” Unicode labels (‘a’,

‘4’, ‘\t’)String: mathematical string of

charactertext streams: mathematical string of

character

Basic Types: Sets and StringsSets: denoted with { }

A set is an unordered collection of items. Sets don’t retain information about duplicates. The number of elements in a set is its size (or cardinality).{ } is the empty set (the only set with no elements){ 1, 2 } = { 2 , 1 } = { 1, 1, 2 }

Strings: denoted with < >A string is a totally ordered collection of some

type, which can grow or shrink. Strings don’t have a size; they have a length.< 1, 2, 3 > has the element 1 at the front of the listWe use “Crackle” to abbreviate <‘C’, ‘r’, ‘a’, ‘c’, ‘k’,

‘l’, ‘e’>.

Basic Types: TuplesTuples: denoted with ( )

A tuple is also a totally ordered collection, but the length of a tuple is typically fixed. It also doesn’t matter which item is “first”. All that matters is you can identify which element is which, semantically speaking (when you interpret the meaning).

For example, an ordered pair (x, y) is simply a two-tuple (a tuple of length two). It doesn’t really matter whether you put x or y first, as long as you know which one is x and which one is y!

A database row (name, SSN, birthdate) is another example. Each datum is a separate field in the tuple.

Warning: English, Math, and JavaTODO (see Piazza note for now)

Seven Bridges of Königsburg*

Euler (same fellow who named the

constant e) proved in 1735 that it was

impossible to do this.

Classic problem from the 1700’s: could you cross each bridge in Königsberg exactly once, in a single path?

This seems like a question math could answer… but at the time, math had not yet been applied to this sort of physical “relationship”.

Graph Theory*Euler proved this by inventing a new form

of math called graph theory, which studies connections.

Later, it developed into a related field, topology. Each

region is a vertex…

…each bridge is an

edge.

Euler noticed all 4 vertices have an odd degree (# of edges)...But unless a vertex

is at the start or end, it must have an even degree (one edge in = one edge out). A contradiction!

Graph Terminology*Graph: a set of vertices (also called nodes)

connected by edges.Each edge links two vertices, or one vertex with

itself (called a self-loop or self-edge).Some graphs may have edges that only go in one

direction, or hold other data (color, order, weight).

A path is a sequence of connected vertices. (A simple path is a path with no repeated vertices.)

A cycle is a path that ends on the same vertex from which it started.

Trees*A tree is a very common and

useful type of graph! Threedefinitions are equivalent:

A tree is any undirected graph where:1. Every two vertices are connected by exactly one

simple path.2. It is connected (every vertex can reach every

other) and acyclic (it contains no simple cycles).3. It is connected, and has one fewer edges than

vertices. If permitted, it could also be the order zero graph,

which has no nodes or edges: an empty tree, in other words.

Rooted Tree TerminologyA rooted tree is a tree with one node

specially designated as the root.

Leaves are nodes without children.

Nodeswith children are called internal

nodes.

Size: total number of nodes in the tree.

Height: total number of levels. (We count the root, but it’s often skipped.)

Recursive StructureTrees have a recursive structure to

them. That means you can make a bigger tree by sticking together one or more smaller trees (subtrees).

If our tree is ordered, then we can tell outgoing edges apart. We usually index them starting with zero on the left side.

XML (eXtensible Markup Language)XML is a structured text format for storing

data.HTML is (in a sense) a specialized form of XML.

<?xml version="1.0" encoding="UTF-8"?> <book printISBN="978-1-118-06331-6" webISBN="1-118063-31-7" pubDate="Dec 20 2011"> <author>Cay Horstmann</author> <title> Java for Everyone: Late Objects </title> ... </book>

XML TagsAny text in XML that starts with < and ends

with the first > to follow it is called a tag.There are several kinds of special tags (such

as comment tags), which begin with <? or <! .

All other tags may contain at most one / character.Start-tags contain no / characters.

<example foo=“bar”>End-tags begin with </ .

</example>Empty-element tags end with /> .

<example foo = “bar” />

Markup vs. Character DataEvery start-tag must be followed by an end-tag.The end-tag for a start-tag must appear before

the end-tag for a earlier start-tag (no element overlap).Example: <a></a> is not allowed.

Tags constitute XML “code”. Collectively, tags are called markup.One exception: CData text within the special

CDATA tag is not considered markup. This is because the CDATA tag acts as an escape code for raw data. We won’t be using these fields, however.

Anything that isn’t markup is character data.

ElementsAn element consists of a start-tag, its

matching end-tag, and everything between the two tags (which may include nested elements).

An empty element is an element that holds nothing between its tags.

An empty-element tag is the preferred way to represent a start-tag followed by an end-tag (but both forms are acceptable XML). In this case, the tag by itself is an entire element. <a foo=“bar”></a> = <a foo=“bar” />

XML DocumentsEvery XML document consists of two things:

First, the prolog: an XML declaration followed by zero or more document type declarations.<?xml version = “1.0”>

<!DOCTYPE greeting SYSTEM “hello.dtd”>Last, exactly one element (called the root

element).<tagname maybeAttribute=“value” etc.>

... (everything else goes here)

</tagname>The only stuff that can appear outside of these

areas are processing instructions (special tags used with some applications), comments, and whitespace.

AttributesStart-tags and empty-element tags may have

attributes. Attributes are (name, value) pairs which appear after the tag name, in the format attribute = “value”: <tagname attr1=“val1” attr2=“val2” />

Each attribute within the same tag must use a unique name, but tag names need not be unique.

<hi attr1=“foo” attr2 = “foo”> <hi>Same tag name.</hi></hi>

Tags and Character Data Indices

<tag> (index 0; the root)

Character data. (index 0, then index 0)

<tag>Char-data!</tag> (index 0, then index 1)

<tag /> (index 0, then index 1, then index 0)

</tag>

Tags and their content may look identical, so how can you tell them apart?

XML is ordered, so tags and content may be identified by which thing appeared first.

Full XML Example<?xml version="1.0" encoding="UTF-8"?>

<roottag attrib1=“val1” attrib2=“val2”>

<child0>

<gchild0>Char-data</gchild0>

<gchild1>Char-data</gchild1>

</child0>

More character data at index 1.

<selfclosingchild2 />

More data at index 3.

<child4 attrib1 = “val1”>

Even more char-data.

</child4>

</roottag>

This is the first element (index 0) beneath the root (also index 0). It contains

two elements with sub-indices 0 and 1, each

which contains character data held at index 0.

Is HTML Just a Kind of XML?*Mostly! Well-formed HTML is often a specific

form of XML, but some standards (like HTML 5.0) allow a few things XML forbids.

Also, most HTML documents are not well-formed.To make HTML well-formed XML, you’d need to

replace ill-formed HTML such as:

Paragraph text. 

More text on next line.

With something like:

Paragraph text. 

More text on next line.



Even though most HTML out there is not well-

formed, it’s still a good idea to stick with well-formed HTML. Using it

helps to ensure that a page should look the same on

any browser.*

*Except, of course, for IE. Derp.

Tree StructureNotice that the restrictions on how we can

nest tags forces a hierarchical structure. This means XML can be represented easily as a tree!<root>

attr1 = “value1”attr2 =

“value2”

<tag><tag> attr1 = “value1”

<tag> attr1 = “value1”attr2 =

“value2”

<tag>Character

Data

Character Data

Representing Tags and Character DataOne easy way to represent XML as a tree

structure is to treat each element as a separate node.

Nodes can be arranged hierarchically starting from root. Each node will be labeled with the tag for that element, and also contain information about all the attribute (name, value) pairs.

To handle character data, we can treat it as a separate node under the element it appears within.

XML as a Tree<root attr=“val”>

<mid attr1=“val1” attr2=“val2”>

<bottom>Hey there!</bottom>

<fluttershy />

</mid>

<stuff is=“example” />

</root>

<root> attr = “val”

<mid>attr1 = “val1”attr2 = “val2”

<stuff>is =

“example”

<bottom><fluttershy>

Hey there!

Character data will always appear as a leaf of the tree, since it holds no elements inside.

Labels and AttributesThis design simplifies how we can

describe (and use) such a tree.An “XML tree object” consists of:

1. A single node that holds character data, or…

2. A single node that holds an element (identified by its tag name), plus zero or more XML tree objects indexed (starting from 0) in the order in which each appeared in the XML document.

Notice how case 1 is not a valid XML document! It’s still valuable to include it, however, because it makes it much easier to define our model.

XMLTreeFormally, an XMLTree object is modeled by

ordered tree of nodes, where a node is an ordered triple: ( isTag, label, set of (name, value) ).In this triple, isTag is a boolean value; while label,

name, and value are each strings of character.There are two further constraints:

An XMLTree is not permitted to have the same name in its set of (name, value) appear with different values. In other words, attribute names are unique within each tag.

If isTag is false, then the set must be empty and this node must be a leaf in the tree. In other words, character data can’t have attributes or child nodes.

XMLTree Instance MethodsIterator<String> attributeIterator()String attributeValue(String name)XMLTree child(int k)display()boolean hasAttribute(String name)boolean isTag()String label()int numberOfChildren()String toString()

Working with XMLTreeEvery XMLTree is either:

A valid representation of an XML document (or, a portion of one, which could stand on its own)…

…or else it is simply character data (e.g., a String).

You can always access the root and:Check to see if it’s a tag with isTag().Use label() to get the tag name or character

data.If it’s a tag, you can also:

Check the numberOfChildren().Get a particular child(int k) tree (the one at

index k).Look at attributes with various other

functions.

Attribute and Display MethodsYou can use attributeIterator() to produce

an Iterator<String> holding all of the attribute names.

Alternately, you can check to see if a particular attribute name exists with hasAttribute().

To get the value for an attribute, pass the name of the attribute to attributeValue().

The display() method does a fancy display of your XMLTree in a new window, and toString() has been written so if you treat an XMLTree like a String, it will actually be legible.

How to Handle Unfamiliar Trees…?*It should be obvious that you can write

code to allow someone to look through a tree, or pull apart a tree if you know its structure in advance.

But how do you write code to blindly work with a tree you’ve never seen before?How could you write the code to print out an

entire XMLTree, if you didn’t have display() or toString()?

How could you copy all parts of an XMLTree object?

Since we defined XMLTree recursively, you’d need to use recursion to do this simply! We’ll take a closer look at this topic later on in the course.

RSSIt uses XML. That’s pretty much it.

(details TODO, but not urgent)