A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.

A Summary of XISS and Index Fabric

Ho Wai Shing

Contents Definition of Terms XISS (Li and Moon, VLDB2001)

Numbering Scheme Indices Stored Join Algorithms

Index Fabric (Cooper et al, VLDB2001) Patricia Balanced Trie Raw Path Index

Definition of Terms Absolute Path Expression (APE):

the path which start from root, each step is a traversal of child axis or attribute axis, no wildcards

e.g., /, /A/B, /A/@C

Definition of Terms Regular Path Expression (RPE):

may start from root or not, may traverse different axes (restricted

to child, descendant-or-self, attribute for discussions since they are the most commonly used ones)

may contain wildcards e.g., //, /A//C, /A/_/B, //A/B//C/D/@E

XISS XISS = XML Indexing and Storage

System by Li and Moon, published in VLDB

2001, with title “Indexing and Querying XML Data for Regular Path Expressions”

decomposes and stores XML documents in the indices

can answer regular path expressions

XISS - General Idea solve RPE by decomposing RPE into

these 5 basic subexpressions element retrieval attribute retrieval steps involve an element and an

attribute steps involve two elements a Kleene Closure of another

subexpression

XISS - General Idea each subexpression is solved by its

own method: element index lookup attribute index lookup EA-join EE-join KC-join

XISS - General Idea result lists from the

subexpressions are joined to produce the final result

to make this decomposition and join efficient, an efficient method to determine ancestor-descendant relationship is needed

XISS uses an extended preorder based numbering scheme

XISS - Numbering Scheme number all the nodes with a

<order, size> tuple order is assigned based on an

extended preorder traversal size can be imagined as the size of

the subtree rooted at that node

XISS - Numbering Scheme The rules for number assignment

if x precedes y in the preorder traversal, x.order < y.order (preorder)

if x and y are siblings, either x.order + x.size < y.order or y.order + y.size < x.order(siblings won’t overlap)

if x is an ancestor of y, x.order < y.order <= x.order + x.size (ancestor contains descendant)

XISS - Numbering Scheme Actual Assignment

uses heuristics to reserve some “space” between orders

reserve more space to the sizes for future node insertions

attributes are place before sibling elements

XISS - Index Organization There are 5 indices

Name Index Element Index Attribute Index Structure Index Value Table

XISS - Name Index maps element or attribute name to

a name identifier (or nid) nid is used for further query

evaluation representing that element or attribute

reduce the time for string comparison in further index lookup

stored in a B+-tree

XISS - Name Index

Name

B+-tree

nid

XISS - Value Table stores all the string values of the

XML document

vid value

XISS - Element Index input: nid, output: list of element

records implemented by a B+-tree leaves are pointers to list of

document ID (did), each list element points to a list of all elements with the same name in the same document

XISS - Element Index

nid

B+-tree

did list

element list

element list

<order, size>,Depth,ParentID

element record

XISS - Attribute Index Very similar to element index always has a value identifier, vid

XISS - Structure Index Input: did, Output: array containing

all the element and attributes in the document

implemented by a B+-tree

XISS - Structure Index

did

B+-tree

nid<order, size>,Parent order,Child order,Sibling order,Attribute orderrecord array

XISS - Indices When to use which index?

first use Name Index to find nid of the element/attribute to be queried

search Element/Attribute index for the records

if we need values, lookup Value Table use Structure Index to rebuild or

traverse the XML document tree

XISS - Join Algorithms After getting the record lists from

each subexpression, we need to find out which are answers to the original query

e.g., to find /A/B, we found a record list of all element A, another list of all element B, and we have to find out which B’s are A/B

XISS - Join Algorithms Three join algorithms proposed:

EA-join - merges an element record list and an attribute record list (solves A/@B)

EE-join - merges two element record lists (solves A/B or A//B)

KC-join - self-merge an element record list (solves (E)*)

XISS - EA-Join to solve E/@A input: an element record list and an

attribute record list find out the attribute records which

have parents in the element record list

two lists are sorted by did and then order

XISS - EA-join 2-stage sort-merge

group by did first merge using order then output criterion: E is a parent of A

single scan on both list is enough

XISS - EE-join to solve E/_*/E, e.g., E/E, E//E, E/_/E input: two Element record lists, E, F output: (e,f) where e is an ancestor

of f also use 2-stage sort-merge however, may need scanning of lists

multiple times (for special cases, e.g., the document has /A/A/B/B)

XISS - KC-join to solve Kleene Closure of a

subexpression input: a list of element records fits

the base case recursively use EE join on the list,

and stop until no more grow in the result list

Index Fabric by Cooper at el, published in VLDB

2001, with title “A fast index for semistructured data”

has 2 subtypes, raw path index and refined path index

use Patricia technique to compress the index

Index Fabric - General Idea it is a disk balanced indexing

structure based on Patricia each data node is associated with

a key string and this string is stored in the trie index for retrieval

the layered approach in building the index ensure the number of disk pages accessed per query

Index Fabric - General Idea raw path index answers absolute

path queries refined path index answers any

predefined queries the difference is how to generate

the key

Patricia Patricia = Practical Algorithm To

Retrieve Information Coded in Alphanumeric

by Morrison, in JACM 1968 a method to store and retrieve

strings in a space efficient way binary, use bit comparisons, has a

“skip” in each internal node

Patricia an example Patricia trie

2

5 4

101110 101111 110000 110011

0 1

0 01 1

Patricia it’s basically a trie with internal

nodes having single child removed search is done by

branch according to the value of bit at skip

retrieve the string at leaf compare it with the query string

Index Fabric - Balanced Trie The number of disk pages

accessed per query is bounded by the number of layers in the layered index

The idea is similar to that of B-tree, The Patricia trie is decomposed into blocks, and there is an upper layer trie which traverse the blocks

Index Fabric - Balanced Trie e.g.

2

5 4

101110 101111 110000 110011

0 1

0 01 1

2

1

Layer 0Layer 1

Index Fabric - Balanced Trie There are 3 types of links in the

balanced trie: far link: across layer, a result of branching near link: within the same block, a result

of branching direct link: across layer, the root nodes

are the same Each query will access 1 block in 1

layer

Index Fabric - Balanced Trie increase the speed by skipping

nodes of original trie using traversals in upper layers

number of page accessed is bounded

Index Fabric - Raw Path each data node is associated with a

key key = path (encoded in designators) + value

designators are special characters, each represents a name

APE queries are translated to prefix to keys and submitted to the index trie

Index Fabric - Raw Path Example:

<invoice><buyer><name>HKU</name></buyer></invoice> is translated to IBNHKU (bolded & underlined are designators

query of /invoice/buyer/name[“HKU”] is translated to query string IBNHKU

Index Fabric - Refined Path Special designators can be

assigned to special queries (can be regular)

e.g., we define P as the path //buyer/name, and PHKU means there is a buyer/name has value HKU in the document

can answer any predefined RPE very quickly

Comparison XISS

can solve general RPE

solve APE by dividing it into steps

Index Fabric RPE solved by

compile time expansion of RPE or using predefined Refined Path Index

solve APE by single index lookup

A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.

Documents

order y

order y

order x

list of element

child order

order preorderif x

sibling order

parent order