A Summary of XISS and Index Fabric Ho Wai Shing
Dec 31, 2015
A Summary of XISS and Index Fabric
Ho Wai Shing
Contents Definition of Terms XISS (Li and Moon, VLDB2001)
Numbering Scheme Indices Stored Join Algorithms
Index Fabric (Cooper et al, VLDB2001) Patricia Balanced Trie Raw Path Index
Definition of Terms Absolute Path Expression (APE):
the path which start from root, each step is a traversal of child axis or attribute axis, no wildcards
e.g., /, /A/B, /A/@C
Definition of Terms Regular Path Expression (RPE):
may start from root or not, may traverse different axes (restricted
to child, descendant-or-self, attribute for discussions since they are the most commonly used ones)
may contain wildcards e.g., //, /A//C, /A/_/B, //A/B//C/D/@E
XISS XISS = XML Indexing and Storage
System by Li and Moon, published in VLDB
2001, with title “Indexing and Querying XML Data for Regular Path Expressions”
decomposes and stores XML documents in the indices
can answer regular path expressions
XISS - General Idea solve RPE by decomposing RPE into
these 5 basic subexpressions element retrieval attribute retrieval steps involve an element and an
attribute steps involve two elements a Kleene Closure of another
subexpression
XISS - General Idea each subexpression is solved by its
own method: element index lookup attribute index lookup EA-join EE-join KC-join
XISS - General Idea result lists from the
subexpressions are joined to produce the final result
to make this decomposition and join efficient, an efficient method to determine ancestor-descendant relationship is needed
XISS uses an extended preorder based numbering scheme
XISS - Numbering Scheme number all the nodes with a
<order, size> tuple order is assigned based on an
extended preorder traversal size can be imagined as the size of
the subtree rooted at that node
XISS - Numbering Scheme The rules for number assignment
if x precedes y in the preorder traversal, x.order < y.order (preorder)
if x and y are siblings, either x.order + x.size < y.order or y.order + y.size < x.order(siblings won’t overlap)
if x is an ancestor of y, x.order < y.order <= x.order + x.size (ancestor contains descendant)
XISS - Numbering Scheme Actual Assignment
uses heuristics to reserve some “space” between orders
reserve more space to the sizes for future node insertions
attributes are place before sibling elements
XISS - Index Organization There are 5 indices
Name Index Element Index Attribute Index Structure Index Value Table
XISS - Name Index maps element or attribute name to
a name identifier (or nid) nid is used for further query
evaluation representing that element or attribute
reduce the time for string comparison in further index lookup
stored in a B+-tree
XISS - Name Index
Name
B+-tree
nid
XISS - Value Table stores all the string values of the
XML document
vid value
XISS - Element Index input: nid, output: list of element
records implemented by a B+-tree leaves are pointers to list of
document ID (did), each list element points to a list of all elements with the same name in the same document
XISS - Element Index
nid
B+-tree
did list
element list
element list
<order, size>,Depth,ParentID
element record
XISS - Attribute Index Very similar to element index always has a value identifier, vid
XISS - Structure Index Input: did, Output: array containing
all the element and attributes in the document
implemented by a B+-tree
XISS - Structure Index
did
B+-tree
nid<order, size>,Parent order,Child order,Sibling order,Attribute orderrecord array
XISS - Indices When to use which index?
first use Name Index to find nid of the element/attribute to be queried
search Element/Attribute index for the records
if we need values, lookup Value Table use Structure Index to rebuild or
traverse the XML document tree
XISS - Join Algorithms After getting the record lists from
each subexpression, we need to find out which are answers to the original query
e.g., to find /A/B, we found a record list of all element A, another list of all element B, and we have to find out which B’s are A/B
XISS - Join Algorithms Three join algorithms proposed:
EA-join - merges an element record list and an attribute record list (solves A/@B)
EE-join - merges two element record lists (solves A/B or A//B)
KC-join - self-merge an element record list (solves (E)*)
XISS - EA-Join to solve E/@A input: an element record list and an
attribute record list find out the attribute records which
have parents in the element record list
two lists are sorted by did and then order
XISS - EA-join 2-stage sort-merge
group by did first merge using order then output criterion: E is a parent of A
single scan on both list is enough
XISS - EE-join to solve E/_*/E, e.g., E/E, E//E, E/_/E input: two Element record lists, E, F output: (e,f) where e is an ancestor
of f also use 2-stage sort-merge however, may need scanning of lists
multiple times (for special cases, e.g., the document has /A/A/B/B)
XISS - KC-join to solve Kleene Closure of a
subexpression input: a list of element records fits
the base case recursively use EE join on the list,
and stop until no more grow in the result list
Index Fabric by Cooper at el, published in VLDB
2001, with title “A fast index for semistructured data”
has 2 subtypes, raw path index and refined path index
use Patricia technique to compress the index
Index Fabric - General Idea it is a disk balanced indexing
structure based on Patricia each data node is associated with
a key string and this string is stored in the trie index for retrieval
the layered approach in building the index ensure the number of disk pages accessed per query
Index Fabric - General Idea raw path index answers absolute
path queries refined path index answers any
predefined queries the difference is how to generate
the key
Patricia Patricia = Practical Algorithm To
Retrieve Information Coded in Alphanumeric
by Morrison, in JACM 1968 a method to store and retrieve
strings in a space efficient way binary, use bit comparisons, has a
“skip” in each internal node
Patricia an example Patricia trie
2
5 4
101110 101111 110000 110011
0 1
0 01 1
Patricia it’s basically a trie with internal
nodes having single child removed search is done by
branch according to the value of bit at skip
retrieve the string at leaf compare it with the query string
Index Fabric - Balanced Trie The number of disk pages
accessed per query is bounded by the number of layers in the layered index
The idea is similar to that of B-tree, The Patricia trie is decomposed into blocks, and there is an upper layer trie which traverse the blocks
Index Fabric - Balanced Trie e.g.
2
5 4
101110 101111 110000 110011
0 1
0 01 1
2
1
Layer 0Layer 1
Index Fabric - Balanced Trie There are 3 types of links in the
balanced trie: far link: across layer, a result of branching near link: within the same block, a result
of branching direct link: across layer, the root nodes
are the same Each query will access 1 block in 1
layer
Index Fabric - Balanced Trie increase the speed by skipping
nodes of original trie using traversals in upper layers
number of page accessed is bounded
Index Fabric - Raw Path each data node is associated with a
key key = path (encoded in designators) + value
designators are special characters, each represents a name
APE queries are translated to prefix to keys and submitted to the index trie
Index Fabric - Raw Path Example:
<invoice><buyer><name>HKU</name></buyer></invoice> is translated to IBNHKU (bolded & underlined are designators
query of /invoice/buyer/name[“HKU”] is translated to query string IBNHKU
Index Fabric - Refined Path Special designators can be
assigned to special queries (can be regular)
e.g., we define P as the path //buyer/name, and PHKU means there is a buyer/name has value HKU in the document
can answer any predefined RPE very quickly
Comparison XISS
can solve general RPE
solve APE by dividing it into steps
Index Fabric RPE solved by
compile time expansion of RPE or using predefined Refined Path Index
solve APE by single index lookup