Top Banner
Software Engineering Lecture 3 Annatala Wolf TREES AND XML
30

Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Dec 24, 2015

Download

Documents

Kristian Burke
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Software Engineering Lecture 3Annatala Wolf

TREES AND XML

Page 2: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Mathematical ModelingWhen we consider types abstractly (from

the client perspective), we can use math to describe them in an unambiguous manner.int: integers (0, -14, 9000)double: real numbers (0.0, 3.14159,

-0.00008)boolean: Boolean value (true, false)char: “character” Unicode labels (‘a’,

‘4’, ‘\t’)String: mathematical string of

charactertext streams: mathematical string of

character

Page 3: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Basic Types: Sets and StringsSets: denoted with { }

A set is an unordered collection of items. Sets don’t retain information about duplicates. The number of elements in a set is its size (or cardinality).{ } is the empty set (the only set with no elements){ 1, 2 } = { 2 , 1 } = { 1, 1, 2 }

Strings: denoted with < >A string is a totally ordered collection of some

type, which can grow or shrink. Strings don’t have a size; they have a length.< 1, 2, 3 > has the element 1 at the front of the listWe use “Crackle” to abbreviate <‘C’, ‘r’, ‘a’, ‘c’, ‘k’,

‘l’, ‘e’>.

Page 4: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Basic Types: TuplesTuples: denoted with ( )

A tuple is also a totally ordered collection, but the length of a tuple is typically fixed. It also doesn’t matter which item is “first”. All that matters is you can identify which element is which, semantically speaking (when you interpret the meaning).

For example, an ordered pair (x, y) is simply a two-tuple (a tuple of length two). It doesn’t really matter whether you put x or y first, as long as you know which one is x and which one is y!

A database row (name, SSN, birthdate) is another example. Each datum is a separate field in the tuple.

Page 5: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Warning: English, Math, and JavaTODO (see Piazza note for now)

Page 6: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Seven Bridges of Königsburg*

Euler (same fellow who named the

constant e) proved in 1735 that it was

impossible to do this.

Classic problem from the 1700’s: could you cross each bridge in Königsberg exactly once, in a single path?

This seems like a question math could answer… but at the time, math had not yet been applied to this sort of physical “relationship”.

Page 7: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Graph Theory*Euler proved this by inventing a new form

of math called graph theory, which studies connections.

Later, it developed into a related field, topology. Each

region is a vertex…

…each bridge is an

edge.

Euler noticed all 4 vertices have an odd degree (# of edges)...But unless a vertex

is at the start or end, it must have an even degree (one edge in = one edge out). A contradiction!

Page 8: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Graph Terminology*Graph: a set of vertices (also called nodes)

connected by edges.Each edge links two vertices, or one vertex with

itself (called a self-loop or self-edge).Some graphs may have edges that only go in one

direction, or hold other data (color, order, weight).

A path is a sequence of connected vertices. (A simple path is a path with no repeated vertices.)

A cycle is a path that ends on the same vertex from which it started.

Page 9: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Trees*A tree is a very common and

useful type of graph! Threedefinitions are equivalent:

A tree is any undirected graph where:1. Every two vertices are connected by exactly one

simple path.2. It is connected (every vertex can reach every

other) and acyclic (it contains no simple cycles).3. It is connected, and has one fewer edges than

vertices. If permitted, it could also be the order zero graph,

which has no nodes or edges: an empty tree, in other words.

Page 10: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Rooted Tree TerminologyA rooted tree is a tree with one node

specially designated as the root.

Leaves are nodes without children.

Nodeswith children are called internal

nodes.

Size: total number of nodes in the tree.

Height: total number of levels. (We count the root, but it’s often skipped.)

Page 11: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Recursive StructureTrees have a recursive structure to

them. That means you can make a bigger tree by sticking together one or more smaller trees (subtrees).

If our tree is ordered, then we can tell outgoing edges apart. We usually index them starting with zero on the left side.

Page 12: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

XML (eXtensible Markup Language)XML is a structured text format for storing

data.HTML is (in a sense) a specialized form of XML.

<?xml version="1.0" encoding="UTF-8"?> <book printISBN="978-1-118-06331-6" webISBN="1-118063-31-7" pubDate="Dec 20 2011"> <author>Cay Horstmann</author> <title> Java for Everyone: Late Objects </title> ... </book>

Page 13: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

XML TagsAny text in XML that starts with < and ends

with the first > to follow it is called a tag.There are several kinds of special tags (such

as comment tags), which begin with <? or <! .

All other tags may contain at most one / character.Start-tags contain no / characters.

<example foo=“bar”>End-tags begin with </ .

</example>Empty-element tags end with /> .

<example foo = “bar” />

Page 14: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Markup vs. Character DataEvery start-tag must be followed by an end-tag.The end-tag for a start-tag must appear before

the end-tag for a earlier start-tag (no element overlap).Example: <a><b></a></b> is not allowed.

Tags constitute XML “code”. Collectively, tags are called markup.One exception: CData text within the special

CDATA tag is not considered markup. This is because the CDATA tag acts as an escape code for raw data. We won’t be using these fields, however.

Anything that isn’t markup is character data.

Page 15: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

ElementsAn element consists of a start-tag, its

matching end-tag, and everything between the two tags (which may include nested elements).

An empty element is an element that holds nothing between its tags.

An empty-element tag is the preferred way to represent a start-tag followed by an end-tag (but both forms are acceptable XML). In this case, the tag by itself is an entire element. <a foo=“bar”></a> = <a foo=“bar” />

Page 16: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

XML DocumentsEvery XML document consists of two things:

First, the prolog: an XML declaration followed by zero or more document type declarations.<?xml version = “1.0”>

<!DOCTYPE greeting SYSTEM “hello.dtd”>Last, exactly one element (called the root

element).<tagname maybeAttribute=“value” etc.>

... (everything else goes here)

</tagname>The only stuff that can appear outside of these

areas are processing instructions (special tags used with some applications), comments, and whitespace.

Page 17: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

AttributesStart-tags and empty-element tags may have

attributes. Attributes are (name, value) pairs which appear after the tag name, in the format attribute = “value”: <tagname attr1=“val1” attr2=“val2” />

Each attribute within the same tag must use a unique name, but tag names need not be unique.

<hi attr1=“foo” attr2 = “foo”> <hi>Same tag name.</hi></hi>

Page 18: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Tags and Character Data Indices

<tag> (index 0; the root)

Character data. (index 0, then index 0)

<tag>Char-data!</tag> (index 0, then index 1)

<tag /> (index 0, then index 1, then index 0)

</tag>

Tags and their content may look identical, so how can you tell them apart?

XML is ordered, so tags and content may be identified by which thing appeared first.

Page 19: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Full XML Example<?xml version="1.0" encoding="UTF-8"?>

<roottag attrib1=“val1” attrib2=“val2”>

<child0>

<gchild0>Char-data</gchild0>

<gchild1>Char-data</gchild1>

</child0>

More character data at index 1.

<selfclosingchild2 />

More data at index 3.

<child4 attrib1 = “val1”>

Even more char-data.

</child4>

</roottag>

This is the first element (index 0) beneath the root (also index 0). It contains

two elements with sub-indices 0 and 1, each

which contains character data held at index 0.

Page 20: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Is HTML Just a Kind of XML?*Mostly! Well-formed HTML is often a specific

form of XML, but some standards (like HTML 5.0) allow a few things XML forbids.

Also, most HTML documents are not well-formed.To make HTML well-formed XML, you’d need to

replace ill-formed HTML such as:<p>

Paragraph text.<br>

More text on next line.

With something like:<p>

Paragraph text.<br />

More text on next line.

</p>

Even though most HTML out there is not well-

formed, it’s still a good idea to stick with well-formed HTML. Using it

helps to ensure that a page should look the same on

any browser.*

*Except, of course, for IE. Derp.

Page 21: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Tree StructureNotice that the restrictions on how we can

nest tags forces a hierarchical structure. This means XML can be represented easily as a tree!<root>

attr1 = “value1”attr2 =

“value2”

<tag><tag> attr1 = “value1”

<tag> attr1 = “value1”attr2 =

“value2”

<tag>Character

Data

Character Data

Page 22: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Representing Tags and Character DataOne easy way to represent XML as a tree

structure is to treat each element as a separate node.

Nodes can be arranged hierarchically starting from root. Each node will be labeled with the tag for that element, and also contain information about all the attribute (name, value) pairs.

To handle character data, we can treat it as a separate node under the element it appears within.

Page 23: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

XML as a Tree<root attr=“val”>

<mid attr1=“val1” attr2=“val2”>

<bottom>Hey there!</bottom>

<fluttershy />

</mid>

<stuff is=“example” />

</root>

<root> attr = “val”

<mid>attr1 = “val1”attr2 = “val2”

<stuff>is =

“example”

<bottom><fluttershy>

Hey there!

Character data will always appear as a leaf of the tree, since it holds no elements inside.

Page 24: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Labels and AttributesThis design simplifies how we can

describe (and use) such a tree.An “XML tree object” consists of:

1. A single node that holds character data, or…

2. A single node that holds an element (identified by its tag name), plus zero or more XML tree objects indexed (starting from 0) in the order in which each appeared in the XML document.

Notice how case 1 is not a valid XML document! It’s still valuable to include it, however, because it makes it much easier to define our model.

Page 25: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

XMLTreeFormally, an XMLTree object is modeled by

ordered tree of nodes, where a node is an ordered triple: ( isTag, label, set of (name, value) ).In this triple, isTag is a boolean value; while label,

name, and value are each strings of character.There are two further constraints:

An XMLTree is not permitted to have the same name in its set of (name, value) appear with different values. In other words, attribute names are unique within each tag.

If isTag is false, then the set must be empty and this node must be a leaf in the tree. In other words, character data can’t have attributes or child nodes.

Page 26: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

XMLTree Instance MethodsIterator<String> attributeIterator()String attributeValue(String name)XMLTree child(int k)display()boolean hasAttribute(String name)boolean isTag()String label()int numberOfChildren()String toString()

Page 27: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Working with XMLTreeEvery XMLTree is either:

A valid representation of an XML document (or, a portion of one, which could stand on its own)…

…or else it is simply character data (e.g., a String).

You can always access the root and:Check to see if it’s a tag with isTag().Use label() to get the tag name or character

data.If it’s a tag, you can also:

Check the numberOfChildren().Get a particular child(int k) tree (the one at

index k).Look at attributes with various other

functions.

Page 28: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

Attribute and Display MethodsYou can use attributeIterator() to produce

an Iterator<String> holding all of the attribute names.

Alternately, you can check to see if a particular attribute name exists with hasAttribute().

To get the value for an attribute, pass the name of the attribute to attributeValue().

The display() method does a fancy display of your XMLTree in a new window, and toString() has been written so if you treat an XMLTree like a String, it will actually be legible.

Page 29: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

How to Handle Unfamiliar Trees…?*It should be obvious that you can write

code to allow someone to look through a tree, or pull apart a tree if you know its structure in advance.

But how do you write code to blindly work with a tree you’ve never seen before?How could you write the code to print out an

entire XMLTree, if you didn’t have display() or toString()?

How could you copy all parts of an XMLTree object?

Since we defined XMLTree recursively, you’d need to use recursion to do this simply! We’ll take a closer look at this topic later on in the course.

Page 30: Software Engineering Lecture 3 Annatala Wolf TREES AND XML.

RSSIt uses XML. That’s pretty much it.

(details TODO, but not urgent)