Streaming XML
Kevin Tankersley
Machines and Algorithms for Real- Time XML Processing
Overview
• XML Filtering Networks
– Overview of XML Processing Tasks
– Streaming XML and XML Data Networks
– XPath Expressions and Regular Expressions
– Node-based NFA Machines for XML Filtering
• Other Formal Models for XML Processing
– Specialized pushdown automata
– Specialized context-free grammars
XML Data
• W3C Standard inspired by HTML– http://www.w3.org/XML/
• Currently used for:– Defining Data
• http://www.w3.org/XML/Schema– Integrating Systems
• http://www.w3.org/TR/soap/• http://www.w3.org/TR/wsdl
– Formatting Data• http://www.w3.org/Style/XSL/• http://www.w3.org/TR/xsl/
– Querying Data• http://www.w3.org/TR/xpath• http://www.w3.org/XML/Query/
DOM Processing
Streaming XML Processing
• Reduce memory requirements by performing XML processing tasks as XML data passes through application
• Example Tasks:– Validate XML
• Ensure XML Data is compliant and well-formed, and that is compliant with DTD/XSD
– Query XML• Extract/Filter subsets of the XML data for further
processing as it passes through application
• Frameworks:– JSR173: Streaming API for XML (StAX)
• javax.xml.stream– .NET XML Streams
Application: XML Data Network
XML Path Language
• Xpath Query:– Location Steps
• Axis• Node test• Predicate
• Axes– Child (default)– Descendent (//)– Attribute (@)
XPath and Regular Expressions
• Consider XPath queries using child and descendent axes, name and * node tests, and no predicates:
• Such queries can be converted to regular expressions:– [university] N* [department]– N* [departments] N [courses]
• Input alphabet consists of nodes N
Designing a Filtering Machine
1. Convert each XPath Query to an NFA
3. Combine into a single NFA– Take advantage of path sharing [Diao et al.,
2003]
5. Convert NFA to a DFA– Constrain to avoid state explosion– Lazy construction [Onizuka, 2003]
6. Add indexes– Stream index [Green et al, 2004]
Example
1. /a/b
2. /a/c
3. /a/b/c
4. /a//b/c
5. /a/*/c
6. /a//c
7. /a/*/*/c
System Architecture
XML as a Context-Free Language
• XML (unlike HTML) must be properly nested– <a><b></b></a> : Valid– <a><b></a></b> : Invalid
• This structure affords the possibility of refining grammars and pushdown automata
• Visibly Pushdown Automata– Refinement of PDAs to enforce proper nesting of
begin and end tags. Originally constructed to analyze call and return sequences in programming languages
• Specialized Document Type Definition– Refinement of context-free grammars to enforce
proper nesting of begin and end tags
Visibly Pushdown Automata
VPDA Example
Specialized DTDs
• Note that tags must properly wrap all expressions yielded by a production
• Note that an SDTD could be converted to a context-free grammar by replacing specializations with nonterminals and nesting production rules
SDTDs and VPDAs• Every VPDA can be converted to an
equivalent PDA
• Every SDTD can be converted into an equivalent context-free grammar
• VPDAs and SDTDs are equivalent in the same way that CFGs and PDAs are
• XML Applications:• Automated machine rewriting for Data
Integration [Thomo et al., 2008]• Streaming type checking [Kumar et al.,
2007]• Streaming querying [Kumar et al., 2007]
References
References