Top Banner
XML Processing Md. Asfak Mahamud KAZ Software Ltd.
41

Xml processing-by-asfak

Nov 19, 2014

Download

Technology

Asfak Mahamud

My Seminar slide taken on XML Processing on 2011-12-20 at KAZ Software Ltd.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Xml processing-by-asfak

XML Processing

Md. Asfak MahamudKAZ Software Ltd.

Page 2: Xml processing-by-asfak

XML and Other Markup Languages

SGML (1973)

HTML (1989)XML (1996)

“XML has several favorable attributes that distinguish it from other competing technologies.

Programmers find XML easy to learn because it is human-readable.

The downside, however, is that an XML document needs to be parsed for it to become machine-readable.”

Ref: XML on a Chip?“A specially prepared document for Sun Microsystem by XimpleWare [6/9/2003]“

Page 3: Xml processing-by-asfak

Regular Language

Regular languages are languages which can be recognized by a computer with finite (i.e. fixed) memory.

Such a computer corresponds to a DFA.

For example, L = {1n | n is even}

However, there are many languages which cannot be recognized using only finite memory, a simple example is the language

L = {0n1n | n E N }

i.e. the language of words which start with a number of 0s followed by the same number of 1sRef: http://www.cs.nott.ac.uk/~txa/g51mal/notes-3x.pdf

Page 4: Xml processing-by-asfak

XML is not regular

“Well-formed XML is not a regular language, and it can-not be parsed by a finite-state automaton, but rather requires at least a push-down automaton (PDA).”

Ref: A Parallel Approach to XML Parsing Wei Lu, Kenneth Chiu,Yinfei Pan

By Pumping Lemma we can prove it.A proof: http://welbog.homeip.net/glue/53/XML-is-not-regular

Page 5: Xml processing-by-asfak

Symantic Analysis

Typical XML Processing

Parsing

inputXML

Output

XMLRef: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University

Page 6: Xml processing-by-asfak

Typical XML Processing

Parsing

Access

Modification

Serialization

inputXML

Output

XMLRef: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University

Symantic Analysis

Page 7: Xml processing-by-asfak

Typical XML Processing

Parsing

Access

Modification

Serialization

inputXML

Output

XMLRef: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University

Performance Bottleneck

Symantic Analysis

Page 8: Xml processing-by-asfak

Typical XML Processing

Parsing

Access

Modification

Serialization

inputXML

Output

XMLRef: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University

Performance Bottleneck

Performance affected by parsing models

Symantic Analysis

Page 9: Xml processing-by-asfak

Steps in Parsing

Parsing

Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University

Character Conversion

Lexical Analysis (FSM)

SyntacticAnalysis

(PDA)

Bit Sequence

36 61 3E

Character Sequence

‘<‘ ‘a’ ‘>’

TokenSequence(‘<a>’ ‘X’ ‘</a>’)

Data Representation

(tree, event, integer array)

Page 10: Xml processing-by-asfak

Steps in Parsing

Parsing

Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University

Character Conversion

Lexical Analysis (FSM)

SyntacticAnalysis

(PDA)

Bit Sequence

36 61 3E

Character Sequence

‘<‘ ‘a’ ‘>’

TokenSequence(‘<a>’ ‘X’ ‘</a>’)

Data Representation

(tree, event, integer array)

Invariantamong different parsing models

Page 11: Xml processing-by-asfak

Steps in Parsing

Parsing

Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University

Character Conversion

Lexical Analysis (FSM)

SyntacticAnalysis

(PDA)

Bit Sequence

36 61 3E

Character Sequence

‘<‘ ‘a’ ‘>’

TokenSequence(‘<a>’ ‘X’ ‘</a>’)

Data Representation

(tree, event, integer array)

PARSING MODEL DEPENDENT

Invariantamong different parsing models

Differentamong differentparsing models

Page 12: Xml processing-by-asfak

Xml Processing: DOM & SAX or StAX

Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University

Page 13: Xml processing-by-asfak

Why DOM is memory intensive?• Overhead of allocating small memory blocks

– OS pre-divides heap into linked lists of small fixed-size free memory blocks, also known as buckets. Any request for a small memory block will be assigned by OS a smallest pre-allocated block in the bucket that the fits the size of the request. For instance, a request to allocate a single-byte returns a 16-byte chunk (an 8-byte memory block plus 8 byte for boundary tags). When the OS has to allocate lots of small memory blocks, the overhead can become very significant.

• Unnecessary de-coupling between a node object and its name

– A node object is a small memory block containing a pointer to the node name in the form of a string object, which is another small block. The binding between node object and node name plays right into the weakness of the OS: It is like the overhead of small memory blocks isn’t bad enough – DOM "knowingly" creates as many small blocks as possible to take advantage of the "overhead."

Ref: XML on a Chip?“A specially prepared document for Sun Microsystem by XimpleWare [6/9/2003]“

Page 14: Xml processing-by-asfak

Efficiency Problems of DOM and SAX/StAX Parsing Models

• Extractive

Ref: VTD-XML-based Design and Implementation of GML Parsing Project Lan Xiaoji, Su Jianqiang, Cai Jinbao

Page 15: Xml processing-by-asfak

Efficiency Problems of DOM and SAX/StAX Parsing Models (contd.)

• Encoding

Ref: VTD-XML-based Design and Implementation of GML Parsing Project Lan Xiaoji, Su Jianqiang, Cai Jinbao

“Even a small change does the DOM model make on the XML document; it must decode the entire document first, and then build the structure. It is a virtually overhead.”

Page 16: Xml processing-by-asfak

XML Processing: VTD

Virtual Token Descriptor

- developed by XimpleWare. - dual-licensed under GPL and proprietary license. - originally written in Java, but is now available in C, C++ and C#. - latest version 2.10 (2011, Feb)

Page 17: Xml processing-by-asfak

VTD-XML• Non-Extractive, Document-Centric Parsing

– Traditionally, a lexical analyzer represents tokens (the small units of indivisible character values) as discrete string objects. This approach is designated extractive parsing. In contrast, non-extractive tokenization mandates that one keeps the source text intact, and uses offsets and lengths to describe those tokens.

• Virtual Token Descriptor– Virtual Token Descriptor (VTD) applies the concept of non-extractive,

document-centric parsing to XML processing. A VTD record uses a 64-bit integer to encode the offset, length, token type and nesting depth of a token in an XML document. Because all VTD records are 64-bit in length, they can be stored efficiently and managed as an array.

• Location Cache– Location Caches (LC) build on VTD records to provide efficient random access.

Organized as tables, with one table per nesting depth level, LCs contain entries modeling an XML document's element hierarchy. An LC entry is a 64-bit integer encoding a pair of 32-bit values. The upper 32 bits identify the VTD record for the corresponding element. The lower 32 bits identify that element's first child in the LC at the next lower nesting level.

Ref: http://en.wikipedia.org/wiki/VTD-XML

Page 18: Xml processing-by-asfak

VTD: inside VTD record

Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University

Page 19: Xml processing-by-asfak

Xml Processing: VTD

Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University

Page 20: Xml processing-by-asfak

VTD-XML

Parsed Representation of XML. Image: http://vtd-xml.sourceforge.net/technical/2.html

Page 21: Xml processing-by-asfak

VTD-XML

Resolving child elements using Location Cache. Image: http://vtd-xml.sourceforge.net/technical/2.html

Page 22: Xml processing-by-asfak

James Clark (on 2002)

“Improve XML processing models.

Right now, developers are generally caught between the inefficiencies of DOM and the unfamiliar feel of SAX.

An API that offers the best of both is needed.”

Ref: Keeping pace with James Clark https://www.ibm.com/developerworks/xml/library/x-jclark.html?dwzone=xml

http://www.jclark.com/bio.htm

Page 23: Xml processing-by-asfak

VTD-XML has both DOM and SAX like features.

“After the parser finishes processing XML, the processing model provides two views of the underlying XML data.

The first is a flat view of all VTD records corresponding to all

tokens in XML in document order, it can be thought of as a view of cached SAX events.

The second is a hierarchical view enabled by a cursor-based

navigation API allowing for DOM-like random access within the document. And the cursor always points to the VTD record of the current element.”

Ref: http://vtd-xml.sourceforge.net/technical/3.html

Page 24: Xml processing-by-asfak

Demo

Page 25: Xml processing-by-asfak

VTD Most memory-efficient (1.3x~1.5x the size of an XML

document) random-access XML parser.

Ref: http://vtd-xml.sourceforge.net/benchmark4.html http://vtd-xml.sourceforge.net/technical/2.html

n1  = total tokens (including ending tags) n2 = tokens for starting tagss = document of size (in bytes)

(n1 - n2) x8  = Total size of VTD records in bytes (without ending tags)

n2x8 = Total size of LCs (totally indexed, i.e. one LC entry per element).

Memory usage in bytes: (s + 8x(n1-n2) + 8xn2) = s + 8xn1.

Page 26: Xml processing-by-asfak

VTDFastest XML parser

Fastest  XPath 1.0 implementation

Ref: http://vtd-xml.sourceforge.net/benchmark4.html

Page 27: Xml processing-by-asfak

VTD• World's only incremental-update

capable XML parser capable of cutting, pasting, splitting and

assembling XML documents with max efficiency.– Ref: http://www.javaworld.com/javaworld/jw-07-2006/jw-0724-vtdxml.html

• World's only XML parser that allows you to use XPath to process 256 GB XML documents.

Ref: http://vtd-xml.sourceforge.net

Page 28: Xml processing-by-asfak

Incremental Update (Do not touch un-required content)

Problem: Change ‘red’ to ‘blue’<color> red

</color>

Human Approach:

1. open the file with a simple notepad, 2. move the cursor to the start of the text node, 3. replace "red" with "blue"

DOM Approach:1. Build the DOM tree2. Navigate to and then update the text node3. Write the updated structure back into XML

Ref: http://www.javaworld.com/javaworld/jw-07-2006/jw-0724-vtdxml.html

 ”if we humans ca

n edit XML lik

e this, why can't X

ML parsers “

- Jimmy Zhang, Ja

vaWorld.com, 07/24/06

Page 29: Xml processing-by-asfak

Demo: Incremental Update

Page 30: Xml processing-by-asfak

VTD on Android Platform

Ref: Analyzing XML Parsers Performance for Android Platform M V Uttam Tej ,Dhanaraj Cheelu, M.Rajasekhara Babu, P Venkata Krishna SCSE, VIT University, Vellore, Tamil Nadu

Page 31: Xml processing-by-asfak

Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University

Page 32: Xml processing-by-asfak

Comparisons (contd.)

Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University

Page 33: Xml processing-by-asfak

Comparisons (contd.)

Ref: XML Document Parsing: Operational and Performance Characteristics Tak Cheung Lam and Jianxun Jason Ding (Cisco Systems) Jyh-Charn Liu, Texas A&M University

Page 34: Xml processing-by-asfak

VTD-XML’s Limitations

• As a file format, it increases the document size by about 30% to 50%.

• As an API, it is not compatible with DOM or SAX.

• It is difficult to support certain validation techniques, employed by DTD and XML Schema (e.g., default attributes and elements), that require modifications to the XML instances being parsed.

Ref: http://en.wikipedia.org/wiki/VTD-XML

Page 35: Xml processing-by-asfak

Parallel Approach to XML Parsing

A Parallel Approach to XML ParsingWei Lu, Kenneth Chiu, Yinfei Pan

Page 36: Xml processing-by-asfak

Parallel Approach to XML Parsing (cont.)

A Parallel Approach to XML ParsingWei Lu, Kenneth Chiu, Yinfei Pan

Page 37: Xml processing-by-asfak

Limitations of PXP

“First, the skeleton requires extra memory that is proportional to the number ofnode in the DOM tree.

Further, the partitioning scheme based on subtrees can causeload imbalance on processing cores for XML documents with irregular or deep tree structures (e.g., TREEBANK with parts-of-speech tagging [29]).

This scheme severely limits the granularity of parallelism that can be achieved, and thus cannot scale with increasing core count.”

Ref: 2.2 PriorWork on Parallel XML Parsing“A Data Parallel Algorithm for XML DOM Parsing”Bhavik Shah1, Praveen R. Rao1, and Bongki Moon2 and Mohan Rajagopalan3

1 University of Missouri-Kansas City2 University of Arizona3 Intel Research Labs

Page 38: Xml processing-by-asfak

ParDOM

Ref: “A Data Parallel Algorithm for XML DOM Parsing”Bhavik Shah1, Praveen R. Rao1, and Bongki Moon2 and Mohan Rajagopalan3

1 University of Missouri-Kansas City2 University of Arizona3 Intel Research Labs

Page 39: Xml processing-by-asfak

ParDOM (contd)

Ref: “A Data Parallel Algorithm for XML DOM Parsing”Bhavik Shah1, Praveen R. Rao1, and Bongki Moon2 and Mohan

Rajagopalan3

1 University of Missouri-Kansas City2 University of Arizona3 Intel Research Labs

Page 40: Xml processing-by-asfak

ParDOM (contd)

Ref: “A Data Parallel Algorithm for XML DOM Parsing”Bhavik Shah1, Praveen R. Rao1, and Bongki Moon2 and Mohan

Rajagopalan3

1 University of Missouri-Kansas City2 University of Arizona3 Intel Research Labs

Page 41: Xml processing-by-asfak

Thank you.