Top Banner
Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB
44

Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Parsing XML into programming languages

JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB

Page 2: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Parsing XML

• Goal: read XML files into data structures in programming languages

• Possible strategies– Parse by hand with some reusable libraries

– Parse into generic tree structure

– Parse as sequence of events

– Automagically parse to language-specific objects

Page 3: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Parsing by-hand

• Advantages– Complete control

– Good if simple needs – build off of regex package

• Disadvantages– Must write the initial code yourself, even if it becomes

generalized

– Pretty tedious and error prone.

– Gets very hard when using schema or DTD to validate

Page 4: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Parsing into generic tree structure

• Advantages– Industry-wide, language neutral standard exists called DOM

(Document Object Model)– Learning DOM for one language makes it easy to learn for any

other– As of JAXP 1.2, support for Schema– Have to write much less code to get XML to something you want

to manipulate in your program

• Disadvantages– Non-intuitive API, doesn’t take full advantage of Java– Still quite a bit of work

Page 5: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

What is JAXP?

• JAXP: Java API for XML Processing– In the Java language, the definition of these standard

API’s (together with XSLT API) comprise a set of interfaces known as JAXP

– Java also provides standard implementations together with vendor pluggability layer

– Some of these come standard with J2SDK, others are only available with Web Services Developers Pack

– We will study these shortly

Page 6: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Another alternative

• JDOM: Native Java published API for representing XML as tree

• Like DOM but much more Java-specific, object oriented

• However, not supported by other languages

• Also, no support for schema

• Dom4j another alternative

Page 7: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

JAXB

• JAXB: Java API for XML Bindings

• Defines an API for automagically representing XML schema as collections of Java classes.

• Most convenient for application programming

• Will cover next class

Page 8: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

DOM

Page 9: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

About DOM

• Stands for Document Object Model

• A World Wide Web Consortium (w3c) standard

• Standard constantly adding new features – Level 3 Core just released in the past six months

• Well cover most of the basics. There’s always more, and it’s always changing.

Page 10: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

DOM abstraction layer in Java -- architecture

Returns specific parserimplementation

org.w3d.dom.Document

Emphasis is on allowing vendors to supply their own DOM Implementation without requiring change to source code

Page 11: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Sample Code

DocumentBuilderFactor factory = DocumentBuilderFactory.newInstance();

/* set some factory options here */

DocumentBuilder builder = factory.newDocumentBuilder();

Document doc = builder.parse(xmlFile);

A factory instanceis the parser implementation.Can be changed with runtime System property. Jdk has default.Xerces much better.

From the factory one obtainsan instance of the parser

xmlFile can be an java.io.File,an inputstream, etc.

javax.xml.parsers.DocumentBuilderFactoryjavax.xml.parsers.DocumentBuilderorg.w3c.dom.Document

For reference. Notice that theDocument class comes from thew3c-specified bindings.

Page 12: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Validation

• Note that by default the parser will not validate against a schema or DTD

• As of JAXP1.2, java provides a default parser than can handle most schema features

• See next slide for details on how to setup

Page 13: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Important: Schema validation

String JAXP_SCHEMA_LANGUAGE =      "http://java.sun.com/xml/jaxp/properties/schemaLanguage"; String W3C_XML_SCHEMA =      "http://www.w3.org/2001/XMLSchema";

Next, you need to configure DocumentBuilderFactory to generate a namespace-aware, validating parser that uses XML Schema:

… DocumentBuilderFactory factory =     DocumentBuilderFactory.newInstance()  factory.setNamespaceAware(true);   factory.setValidating(true); try {    factory.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA); } catch (IllegalArgumentException x) {    // Happens if the parser does not support JAXP 1.2   ... }

Page 14: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Associating document with schema

• An xml file can be associated with a schema in two ways

1. Directly in xml file in regular way

2. Programmatically from java

• Latter is done as:– factory.setAttribute(JAXP_SCHEMA_SOURCE,

   new File(schemaSource));

Page 15: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

A few notes

• Factory allows ease of switching parser implementations– Java provides simple DOM implementation,

but much better to use vendor-supplied when doing serious work

– Xerces, part of apache project, is installed on cluster as Eclipse plugin. We’ll use next week.

– Note that some properties are not supported by all parser implementations.

Page 16: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Document object

• Once a Document object is obtained, rich API to manipulate.

• First call is usually Element root = doc.getDocumentElement();

This gets the root element of the Document as an instance of the Element class

• Note that Element subclasses Node and has methods getType(), getName(), and getValue(), and getChildNodes()

Page 17: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Types of Nodes

• Note that there are many types of Nodes (ie subclasses of Node:Attr, CDATASection, Comment, Document, DocumentFragment, DocumentType, Element, Entity, EntityReference, Notation, ProcessingInstruction, Text

Each of these has a special and non-obvious associated type, value, and name.

Standards are language-neutral and are specified on chart on following slide

Important: keep this chart nearby when using DOM

Page 18: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Node nodeName() nodeValue() Attributes nodeType()

Attr Attr name Value of attribute null 2

CDATASection #cdata-section CDATA cotnent null 4

Comment #comment Comment content null 8

Document #document Null null 9

DocumentFragment #document-fragment

null null 11

DocumentType Doc type name null null 10

Element Tag name null NamedNodeMap 1

Entity Entity name null null 6

EntityReference Name entitry referenced

null null 5

Notation Notation name null null 1

ProcessingInstruction target Entire string null 7

Text #text Actual text null 3

Page 19: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Transforming XML

Page 20: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

The JAXP Transformation Packages

• JAXP Transformation APIs: – javax.xml.transform

• This package defines the factory class you use to get a Transformer object. You then configure the transformer with input (Source) and output (Result) objects, and invoke its transform() method to make the transformation happen. The source and result objects are created using classes from one of the other three packages.

– javax.xml.transform.dom • Defines the DOMSource and DOMResult classes that let you use a DOM as an input to

or output from a transformation. – javax.xml.transform.sax

• Defines the SAXSource and SAXResult classes that let you use a SAX event generator as input to a transformation, or deliver SAX events as output to a SAX event processor.

– javax.xml.transform.stream • Defines the StreamSource and StreamResult classes that let you use an I/O stream as an

input to or output from a transformation.

Page 21: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Transformer Architecture

Page 22: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Writing DOM to XML

public class WriteDOM{ public static void main(String[] argv) throws Exception{ File f = new File(argv[0]); DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); Document document = builder.parse(f);

TransformerFactory tFactory = TransformerFactory.newInstance(); Transformer transformer = tFactory.newTransformer(); DOMSource source = new DOMSource(document); StreamResult result = new StreamResult(System.out); transformer.transform(source, result); }}

Page 23: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Creating a DOM from scratch

• Sometimes you may want to create a DOM tree directly in memory. This is done with:

DocumentBuilderFactory factory =  DocumentBuilderFactory.newInstance();         

DocumentBuilder builder =         factory.newDocumentBuilder();       

 document = builder.newDocument();

Page 24: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Manipulating Nodes

• Once the root node is obtained, typical tree methods exist to manipulate other elements:

boolean node.hasChildNodes()

NodeList node.getChildNodes()

Node node.getNextSibling()

Node node.getParentNode()

String node.getValue();

String node.getName();

String node.getText();

void setNodeValue(String nodeValue);

Node insertBefore(Node new, Node ref);

Page 25: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

SAX

Simple API for XML Processing

Page 26: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

About SAX

• SAX in Java is hosted on source forge

• SAX is not a w3c standard

• Originated purely in Java

• Other languages have chosen to implement in their own ways based on this prototype

Page 27: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

SAX vs. …

• Please don’t compare unrelated things:– SAX is an alternative to DOM, but realize that

DOM is often built on top of SAX

– SAX and DOM do not compete with JAXP

– They do both compete with JAXB implementations

Page 28: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

How a SAX parser works

• SAX parser scans an xml stream on the fly and responds to certain parsing events as it encounters them.

• This is very different than digesting an entire XML document into memory.

• Much faster, requires less memory.

• However, need to reparse if you need to revisit data.

Page 29: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Obtaining a SAX parser

• Important classes javax.xml.parsers.SAXParserFactory;

javax.xml.parsers.SAXParser;

javax.xml.parsers.ParserConfigurationException;

//get the parser

SAXParserFactory factory = SAXParserFactory.newInstance();

SAXParser saxParser = factory.newSAXParser();

//parse the document

saxParser.parse( new File(argv[0]), handler);

Page 30: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

DefaultHandler

• Note that an event handler has to be passed to the SAX parser.

• This must implement the interface

org.xml.sax.ContentHandler;

• Easier to extend the adapter

org.xml.sax.helpers.DefaultHandler

Page 31: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Overriding Handler methods

• Most important methods to override – void startDocument()

• Called once when document parsing begins– void endDocument()

• Called once when parsing ends– void startElement(...)

• Called each time an element begin tag is encountered– void endElement(...)

• Called each time an element end tag is encountered– void characters(...)

• Called randomly between startElement and endElement calls to accumulated character data

Page 32: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

startElement

• public void startElement( String namespaceURI, //if namespace assoc String sName, //nonqualified name String qName, //qualified name Attributes attrs) //list of attributes

• Attribute info is obtained by querying Attributes objects.

Page 33: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Characters

• public void characters(

char buf[], //buffer of chars accumulated

int offset, //begin element of chars

int len) //number of chars

• Note, characters may be called more than once between begin tag / end tag

• Also, mixed-content elements require careful handling

Page 34: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Entity references

• Recall that entity references are special character sequences for referring to characters that have special meaning in XML syntax– ‘<‘ is &lt

– ‘>’ is &gt

• In SAX these are automatically converted and passed to the characters stream unless they are part of a CDATA section

Page 35: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Choosing a Parser

• Choosing your Parser Implementation – If no other factory class is specified, the default SAXParserFactory

class is used. To use a different manufacturer's parser, you can change the value of the environment variable that points to it. You can do that from the command line, like this:

• java -Djavax.xml.parsers.SAXParserFactory=yourFactoryHere ...

• The factory name you specify must be a fully qualified class name (all package prefixes included). For more information, see the documentation in the newInstance() method of the SAXParserFactory class.

Page 36: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Validating SAX ParsersString JAXP_SCHEMA_LANGUAGE =      "http://java.sun.com/xml/jaxp/properties/schemaLanguage"; String W3C_XML_SCHEMA =      "http://www.w3.org/2001/XMLSchema";

Next, you need to configure DocumentBuilderFactory to generate a namespace-aware, validating parser that uses XML Schema:

… SaxParserFactory factory =     SaxParserFactory.newInstance()  factory.setNamespaceAware(true);   factory.setValidating(true); try {    factory.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA); } catch (IllegalArgumentException x) {    // Happens if the parser does not support JAXP 1.2   ... }

Page 37: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Transforming arbitrary data structures using SAX and

Transformer

Page 38: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Goal

• Now that we know SAX and a little about Transformations, there are some cool things we can do.

• One immediate thing is to create xml files from plain text files using the help of a faux SAX parser

• Turns out to be more robust than doing by hand

Page 39: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Transformers

• Recall that transformers easily let us go between any source and result by arbitrary wirings of– StreamSource / StreamResult– SAXSource / SAXResult– DOMSource / DOMResult

• We used this to write a DOM tree to an XML file

• Now we will use a SAXSource together with a StreamResult to convert our text file

Page 40: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Strategy

• We construct our own SAXParser – ie a class that implements the XMLReader interface

• This class must have a parse method (among others)

• We use parse to read our input file and fire the appropriate SAX events.

Page 41: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

What?

• What are we really doing here?

• We’re having the SAXParser pretend as though it has encountered certain SAX XML events when it reads the text file.

• Exactly where we pretend these things occur is where the appropriate XML will get written by the transformer

Page 42: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

Main snippet

public static void main (String argv []){ StudentReader parser = new StudentReader(); TransformerFactory tFactory = TransformerFactory.newInstance(); Transformer transformer = tFactory.newTransformer(); FileReader fr = new FileReader(“students.txt”); BufferedReader br = new BufferedReader(fr); InputSource inputSource = new InputSource(fr); SAXSource source = new SAXSource(parser, inputSource); StreamResult result = new StreamResult(System.out); transformer.transform(source, result); }

create transformer

Create SAX “parser”

Use text as result

Use text File as Transformer source

Page 43: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

XMLReader implementation

• To have a valid SAXSource we need a class that implements XMLReader interface

public void parse(InputSource input)public void setContentHandler(ContentHandler handler) public ContentHandler getContentHandler() ...

•Shown are the important methods for a simple app

Page 44: Parsing XML into programming languages JAXP, DOM, SAX, JDOM/DOM4J, Xerces, Xalan, JAXB.

End