XML Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 25, 2008
Jan 18, 2018
XML Data Management
Zachary G. IvesUniversity of Pennsylvania
CIS 650 – Implementing Data Management Systems
November 25, 2008
Administrivia For next time, please read & review the
TurboXPath paper
2
3
XML: A Format of Many Uses XML has become the standard for data
interchange, and for many document representations
Sometimes we’d like to store it… Collections of text documents, e.g., the Web, doc DBs
… How would we want to query those? IR/text queries, path queries, XQueries?
Interchanging data SOAP messages, RSS, XML streams Perhaps subsets of data from RDBMSs
Storing native, database-like XML data Caching Logging of XML messages
4
XML: Hierarchical Data and Its Challenges It’s not normalized…
It conceptually centers around some origin, meaning that navigation becomes central to querying and visualizing
Contrast with E-R diagrams How to store the hierarchy? Complex navigation may include going up, sideways in tree Updates, locking Optimization
Also, it’s ordered May restrict order of evaluation (or at least presentation) Makes updates more complex
Many of these issues aren’t unique to XML Semistructured databases, esp. with ordered collections,
were similar But our efforts in that area basically failed…
5
Two Ways of Thinking of XML Processing XML databases (today)
Hierarchical storage + locking (Natix, TIMBER, BerkeleyDB, Tamino, …)
Query optimization “Streaming XML” (next time)
RDBMS XML export Partitioning of computation between source and
mediator “Streaming XPath” engines
The difference is in storage (or lack thereof)
6
XML in a Database Use a legacy RDBMS
Shredding [Shanmugasundaram+99] and many others Path-based encodings [Cooper+01] Region-based encodings [Bruno+02][Chen+04] Order preservation in updates [Tatarinov+02], … What’s novel here? How does this relate to materialized
views and warehousing? Native XML databases
Hierarchical storage (Natix, TIMBER, BerkeleyDB, Tamino, …)
Updates and locking Query optimization (e.g., that on Galax)
7
Query Processing for XML Why is optimization harder?
Hierarchy means many more joins (conceptually) “traverse”, “tree-match”, “x-scan”, “unnest”, “path”, … op Though typically parent-child relationships Often don’t have good measure of “fan-out” More ways of optimizing this
Order preservation limits processing in many ways Nested content ~ left outer join
Except that we need to cluster a collection with the parent Relationship with NF2 approach
Tags (don’t really add much complexity except in trying to encode efficiently)
Complex functions and recursion Few real DB systems implement these fully
Why is storage harder? That’s the focus of Natix, really
8
The Natix System In contrast to many pieces of work on
XML, focuses on the bottom layers, equivalent to System R’s RSS
Physical layout Indexing Locking/concurrency control Logging/recovery
9
Physical Layout What are our options in storing XML trees?
At some level, it’s all smoke-and-mirrors Need to map to “flat” byte sequences on disk
But several options: Shred completely, as in many RDBMS mappings
Each path may get its own contiguous set of pages e.g., vectorized XML [Buneman et al.]
An element may get its 1:1 children e.g., shared inlining [Shanmugasundaram+] and [Chen+]
All content may be in one table e.g., [Florescu/Kossmann] and most interval encoded XML
We may embed a few items on the same page and “overflow” the rest
How collections are often stored in ORDBMS We may try to cluster XML trees on the same page, as “interpreted
BLOBs” This is Natix’s approach (and also IBM’s DB2)
Pros and cons of these approaches?
10
Challenges of the Page-per-Tree Approach How big of a tree? What happens if the XML overflows the tree?
Natix claims an adaptive approach to choosing the tree’s granularity Primarily based on balancing the tree, constraints
on children that must appear with a parent What other possibilities make sense?
Natix uses a B+ Tree-like scheme for achieving balance and splitting a tree across pages
11
ExampleSplit point in parent page
Note “proxy” nodes
12
That Was Simple – But What about Updates? Clearly, insertions and deletions can affect
things Deletion may ultimately require us to rebalance Ditto with insertion
But insertion also may make us run out of space – what to do? Their approach: add another page; ultimately may
need to split at multiple levels, as in B+ Tree
Others have studied this problem and used integer encoding schemes (plus B+ Trees) for the order
13
Does this Help? According to general lore, yes
The Natix experiments in this paper were limited in their query and adaptivity loads
But the IBM people say their approach, which is similar, works significantly better than Oracle’s shredded approach
14
There’s More to Updates than the Pages What about concurrency control and
recovery?
We already have a notion of hierarchical locks, but they claim: If we want to support IDREF traversal, and
indexing directly to nodes, we need more What’s the idea behind SPP locking?
15
Logging They claim ARIES needs some modifications – why?
Their changes: Need to make subtree updates more efficient – don’t want
to write a log entry for each subtree insertion Use (a copy of) the page itself as a means of tracking
what was inserted, then batch-apply to WAL “Annihilators”: if we undo a tree creation, then we
probably don’t need to worry about undoing later changes to that tree
A few minor tweaks to minimize undo/redo when only one transaction touches a page
16
Annihilators
17
Assessment Native XML storage isn’t really all that
different from other means of storage There are probably some good reasons to
make a few tweaks in locking Optimization remains harder
A real solution to materialized view creation would probably make RDBMSs come close to delivering the same performance, modulo locking
Next Time: “Streaming XML” An XQuery consists of a series of XPath
expressions in the FOR/LET clauses, plus a WHERE condition and a RETURN constructor
The FOR/LET clauses create bindings between variables and nodes (or node sets)
We can consider a set of bindings to be a tuple
So: can we build an XPath matcher that processes XML across the network, and produces tuple streams?
18