Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense Outline • Introduction – XML and Structural Fluctuation

Post on 14-Oct-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

http://www.xerial.org/ 1

Taro L. SaitoTaro L. SaitoUniversity of Tokyo

<leo@cb.k.u-tokyo.ac.jp>

Purifying XML StructuresPurifying XML StructuresPh.D. DefensePh.D. Defense

2

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

OutlineOutline

•• IntroductionIntroduction– XML and Structural Fluctuation– Amoeba Join

•• Purifying XML StructuresPurifying XML Structures– Functional Dependencies for XML– Amoeba Join Decomposition– Ubiquitous Keys

•• ImplementationImplementation– Amoeba Join Processing Algorithms– XML Indexing– Experimental Results

•• Conclusions Conclusions – Applications– Summary of Contributions & Future Work

3

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

IntroductionIntroduction

•• XML XML (Extensible Markup Language)(Extensible Markup Language)

– A markup language representing a tree structure

– Since 1996, XML has been broadly used as a data representation format

•• Major drawbacksMajor drawbacks– Hierarchical representation of data

is too complex• for both of human and computer

programs• reminiscences of 1970s’ discussion

– Relational v. s. Hierarchical DB

– There exist many alternative tree structures

• to represent a same data model

<bookstore><bookstore><order><order>

<customer><customer>JohnJohn</customer></customer><book><book>

<title><title>Data on the WebData on the Web</title></title></book></book>

</order></order></bookstore></bookstore>

orderorder

customercustomer bookbook

titletitle

4

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Structural FluctuationStructural Fluctuation

•• Differently Structured XML DocumentsDifferently Structured XML Documents– representing a same data model e.g. Amazon.com

• for order, customer, book nodes

– The hierarchical order of order and customer is reversed.– The order node is behind the pending node.

orderorder

customercustomer bookbook“cancelled”

customercustomer

bookbook

pendingpending

orderorder notenote

5

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Querying Structural FluctuationQuerying Structural Fluctuation

•• Standards of XML Processing: XPath, SAX, DOM, etc. Standards of XML Processing: XPath, SAX, DOM, etc. •• Many parse states:Many parse states:

– If we find an order, then parse customer and book– or if we first find an customer, then parse pending/order and book ...–– Such query processing is tedious and errorSuch query processing is tedious and error--prone!prone!

•• Why we need different programs to parse the same meaning XML Why we need different programs to parse the same meaning XML data?data?

orderorder

customercustomer bookbook“cancelled”

customercustomer

bookbook

pendingpending

orderorder notenote

6

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Structural FluctuationsStructural Fluctuations

•• In general, the number of structural fluctuations of In general, the number of structural fluctuations of nn nodes is nnodes is n(n(n--1)1)

– Enumeration of labeled trees of n nodes

7

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Current SolutionCurrent Solution

•• Disallow structural fluctuations by using a schemaDisallow structural fluctuations by using a schema– XML Schema, DTD, RelaxNG, etc.

•• However, fixing a tree structure involves irrelevant work in However, fixing a tree structure involves irrelevant work in defining a data model.defining a data model.– Why we have to choose only one tree structures?

8

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Heuristic ApproachHeuristic Approach

•• SLCA (Smallest Lowest Common Ancestor)SLCA (Smallest Lowest Common Ancestor)– [Li, VLDB2004], [Xu, SIGMOD2005]

– An lca node that does not contain other lca nodes.– However, it easily leads to unintended results

orderorder

customercustomer bookbook

slca of (customer, book)

datadata

customercustomer

9

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Amoeba JoinAmoeba Join

•• Amoeba Join:Amoeba Join: AJ(order, customer, book)AJ(order, customer, book)– [WebDB 2006]– retrieves node tuple such that

• one of (order, customer, book) nodes is a common ancestor of the others.

– Handles every structural fluctuation

orderorder

customercustomer bookbook

amoebaamoeba

“cancelled”

customercustomer

bookbook

pendingpending

orderorder notenote

amoeba rootamoeba root

10

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Semantics of XML StructuresSemantics of XML Structures

•• Semantics implied in XML dataSemantics implied in XML data– Each order node should have a single book node

• Invalid structure might be retrieved without considering such semantics of data.

– Instances of such invalid structures could be numerous

•• To represent semantics of XML data, we introduce To represent semantics of XML data, we introduce functional dependenciesfunctional dependencies for XMLfor XML

11

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Functional Dependency (FD)Functional Dependency (FD)

•• Functional DependencyFunctional Dependency– X → Y : if two tuples p, q agree with X, then also agree with Y

order book title1 b1 Database Systems2 b1 Database Systems3 b2 Data on the Web

order book1 b12 b13 b2

book titleb1 Database Systemsb2 Data on the Web

•• FDs: order FDs: order →→ book, book book, book →→ titletitle

•• FD is generally used to avoid redundancies of dataFD is generally used to avoid redundancies of data– Normal Form

12

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Data Modeling & FDData Modeling & FD

•• FD has an essential role in data modelingFD has an essential role in data modeling– describe one-to-one, one-to-many, many-to-many relationships

• ex. ER (Entity-Relationship) diagram, UML (Unified Modeling Language)

•• ExampleExample– order → book, order → customer

• An order has a book. An order has a customer. • A book has many orders. A customer has many orders (one-to-many)

– book → title, title → book• A book has a title; a title belongs to a book (one-to-one)

– customer, book → order• An order connects many customers and books (many-to-many)

orderordercustomercustomer bookbook1

m n1

titletitle

1

1

13

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Functional Dependencies for XMLFunctional Dependencies for XML

•• Previous Work of FDs for XMLPrevious Work of FDs for XML– [Buneman et al., WWW2001], [Arenas and Libkin, TODS2004]

– based on fixed paths• Because there was no counter part of relation (tables) in XML

– e.g. /order /order →→ /order/book/order/book• Structural fluctuations are not allowed:

– In reality, however, the constraint on the path, a book must be a child of an order, is too strong.

– Their definition has no loss-less decomposition

14

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Relation in XMLRelation in XML

•• Relation in XML allows a zigzag shapeRelation in XML allows a zigzag shape•• For an FD: For an FD: book, customer book, customer →→ orderorder

– (book, customer, order) must be an amoeba

15

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

A set of A set of FDsFDs defines XML structuresdefines XML structures

•• Traditional Approach:Traditional Approach:– XML data (Structured Data) -> Data Model

•• Our approach: Our approach: Data Model (FD) Data Model (FD) --> XML Structures> XML Structures– Allows various XML structures to describe a data model– Enhancing expressive power of XML databases

16

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Amoeba Join Satisfying Amoeba Join Satisfying FDsFDs

•• AJAJF F (order, book, customer) (order, book, customer) – retrieves a relation in XML satisfying a set F of FDs

•• Makes easier managing multiple hierarchies of XML tree structureMakes easier managing multiple hierarchies of XML tree structuress– An amoeba join AJF (order, book, customer) can track D2

17

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Amoeba Join DecompositionAmoeba Join Decomposition

18

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

FD Based XML Query ProcessingFD Based XML Query Processing

•• No explicit path structures are requiredNo explicit path structures are required•• Examples:Examples:

– FDs• book, customer → order ・order → book ・order → customer

– A query for book and order node: AJF (book, order)• book and order nodes compose amoebas

– A query for book and customer nodes:• AJF (book, customer)

– book and customer nodes might be connected through order nodes

• Thus, AJF (book, customer, order) is evaluated

•• Relation inRelation in XML is dynamically determined according to query targetsXML is dynamically determined according to query targets

orderordercustomercustomer bookbook1

m n

1

titletitle

19

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Functional Dependencies and KeysFunctional Dependencies and Keys

•• Key is a special case of a functional Key is a special case of a functional dependencydependency– e.g. order (id) → book, customer

• order (id) is a key

•• Using a relation in XML, we can define keys for XMLUsing a relation in XML, we can define keys for XML– [order@id] → book, customer

• Given an order id, we can uniquely determine book and title nodes• XML structures: <<order, book>>, <<order, customer>>

•• More general description of keysMore general description of keys– In [Buneman, et al. WWW2001], it is not allowed to reverse the

position of order and book nodes

JohnJohnb1b111

LucyLucyb2b222

customercustomerbookbookorder order

20

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Ubiquitous KeysUbiquitous Keys

21

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Querying without using StructuresQuerying without using Structures

•• AJ(book, [pending, order, title])AJ(book, [pending, order, title])– book nodes are merged using ubiquitous keys

http://www.xerial.org/ 22

Amoeba Join ProcessingAmoeba Join Processing

23

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

orgorg

managermanager locationlocation

managermanager

locationlocation“Kyoto”

“Tokyo”

namename

“David” “Michael”

departmentdepartment orgorg

companycompany

Sweep Amoeba Join AlgorithmSweep Amoeba Join Algorithm

•• Fetch all input nodesFetch all input nodes– AJ(org, manager, location)– Sort input nodes in their document orders.

•• Sweep sorted input nodesSweep sorted input nodes– Assume the smallest node in the input nodes as an amoeba root.– Search their descendant regions for components of amoebas

amoebaamoebaamoebaamoebaorgorg

managermanager locationlocation

managermanager

orgorg

locationlocation

24

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Disk I/O OptimizationDisk I/O Optimization

•• AJ(org, manager, location = AJ(org, manager, location = ““TokyoTokyo””))– Choose pivot nodes from a small input domain– Traverse upward to find amoeba root candidates

– Search space for amoeba is localized under the amoeba root candidates.

orgorg

managermanager locationlocation

managermanager

locationlocation“Kyoto”

““TokyoTokyo””

namename

“David” “Michael”

departmentdepartment orgorg

companycompany

locationlocation PivotPivot

orgorg

managermanager

amoeba root candidateamoeba root candidate

http://www.xerial.org/ 25

XML IndexingXML Indexing

26

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

History of XML IndexingHistory of XML Indexing

•• A hundreds of XML indexing papers A hundreds of XML indexing papers ……. . – tailored to specific queries

• XPath query, structural-join (A//D), twig-queries, text search, etc.

– from many research areas• Database Community

– DataGuides (1997), 1-index (1999), XR-tree (2002), PathFinder(2006)

– Node labeling (static or updatable)» Dewey order, ORDPATH(2004), BLAS(2004), Pbi (2005)

• Information Retrieval (IR)– inverted indexes for text data. SLCA (2005)

• Compressed Index– XBW (Ferrangina, WWW2006)

27

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Multidimensional Aspects of XMLMultidimensional Aspects of XML

•• TreeTree--Structure IndexStructure Index– Ancestor, Descendant (subtree),

Sibling•• PathPath--Structure IndexStructure Index

– Suffix-path (//headline/item)

•• An XML Index that can process An XML Index that can process both of the structuresboth of the structuressimultaneouslysimultaneously is strongly requiredis strongly required

28

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Our ApproachOur Approach

•• [DASFAA2007][DASFAA2007]

•• Integrating treeIntegrating tree--structure and path indexes structure and path indexes – As a multidimensional index

• (start, end, level, path)

– It can be implemented on top of the B+-tree

•• Why B+Why B+--trees?trees?– Index structures and transaction management, recovery,

logging, caching etc. are interdependent.

– We already have many transaction management techniques on B+-trees

• Transaction management on R-tree is not seriously supported.

29

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

InvertedInverted--Path IndexPath Index

•• Align inverted paths in the lexicographical orderAlign inverted paths in the lexicographical order– facilitates suffix path queries

•• Examples (suffixExamples (suffix--path query range):path query range):– //item [6, 11)– //headline/item [6, 8)

30

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

ZZ--OrderOrder

•• Align multidimensional points (nodes) in zAlign multidimensional points (nodes) in z--orderorder– Interleave function gives z-order in the multidimensional space

•• Each step in zEach step in z--orders splits slices into twoorders splits slices into two

31

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Range QueryRange Query

•• Traverse B+Traverse B+--tree in the order of ztree in the order of z--orderorder

http://www.xerial.org/ 32

Experimental ResultsExperimental Results

33

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

ImplementationImplementation

•• Xerial Xerial – http://www.xerial.org/– XML Database Management System

• XML data is multi-dimensionally indexed• supporting amoeba joins & XPath queries

– Implemented in C++• about 150,000 lines of codes

– Query compiler & scheduler, query processing algorithms– Database indexing, XML processor, etc.

•• Machine environment for experimentsMachine environment for experiments– Windows XP notebook– Pentium M 2GHz, 1GB Main Memory– 5,400 rpm HDD (100GB)

34

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Database SizeDatabase Size

•• Data set: Data set: XMarkXMark Benchmark XML Document Benchmark XML Document •• Xerial is spaceXerial is space--efficientefficient

35

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

SuffixSuffix--Path Query PerformancePath Query Performance

•• Xerial & pathXerial & path--start index is fasteststart index is fastest

36

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Subtree Retrieval PerformanceSubtree Retrieval Performance

•• All of the indexes shows similar performanceAll of the indexes shows similar performance– XML nodes are sorted in the order of start values

37

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Ancestor RetrievalAncestor Retrieval

•• The number of the previous nodes of a context node The number of the previous nodes of a context node affects the ancestoraffects the ancestor--query performance.query performance.

38

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Sibling Query PerformanceSibling Query Performance

•• Without indexes for levelWithout indexes for level--values, retrievals of sibling values, retrievals of sibling nodes are inefficientnodes are inefficient

39

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Amoeba Join PerformanceAmoeba Join Performance

•• AlgorithmAlgorithm– QK: Quicker, SW: Sweep, BF: Brute Force

•• IndexIndex– I: Index Scan, S: Sequential Scan

•• Quicker algorithm is fastest when we can Quicker algorithm is fastest when we can localize search regionslocalize search regions

40

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Improvement by AJ DecompositionImprovement by AJ Decomposition

•• Without decomposing amoeba joins, the number of Without decomposing amoeba joins, the number of XML structures to be retrieved explodes.XML structures to be retrieved explodes.

http://www.xerial.org/ 41

PerspectivesPerspectives

42

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

ApplicationsApplications

•• Our methods can be applied various XML databasesOur methods can be applied various XML databases

•• Examples of promising applicationsExamples of promising applications•• File SystemsFile Systems

– Represent files with XML format• reorganization and enhancing information of files with tags

•• BioinformaticsBioinformatics– Reorganization of data is frequent

• Statistical analysis (classification, transformation, cleansing, etc.)

– Integration of various data sources is required

43

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

SCMDSCMD

•• SCMD SCMD ((Saccharomyces CerevisiaeSaccharomyces Cerevisiae Morphological Database)Morphological Database)– [NAR04], [NAR05], [PNAS05]

44

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Deep Copies of XML DataDeep Copies of XML Data

cellcell

sizesize

roundnessroundness

clusterclusterpropertyproperty

functionfunction

genegene

<cell><size>…</size><roundness>…</roundness><cluster>

<function>…</function><property>…</property>

</cluster></cell>

<cell><size>…</size><roundness>…</roundness><cluster>

<function>…</function><property>…</property>

</cluster></cell>

<cluster><function>…</function><property>…</property><cell>

<size> … </size><roundness>..</roundness>

</cell></cluster>

<cluster><function>…</function><property>…</property><cell>

<size> … </size><roundness>..</roundness>

</cell></cluster>•• Many duplicates (deep copies) Many duplicates (deep copies)

of dataof data

45

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Shallow Copies of XML DataShallow Copies of XML Data

•• GraphGraph--structured data model can structured data model can be decomposed into several treesbe decomposed into several trees

•• To connect nodes in trees, we need To connect nodes in trees, we need shallow copies of nodes. shallow copies of nodes.

cellcell

sizesize

roundnessroundness

clusterclusterpropertyproperty

functionfunction

genegene<cell id=“1”>

<size>…</size><roundness>…</roundness>

</cell>

<cell id=“1”><size>…</size><roundness>…</roundness>

</cell>

<cluster><function>…</function><property>…</property><cell id=“1”/>

</cluster>

<cluster><function>…</function><property>…</property><cell id=“1”/>

</cluster>

•• With FDWith FD--based query processingbased query processing– It becomes easier to manage shallow-copy representation of XML data

46

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Future WorkFuture Work

•• Query OptimizationQuery Optimization– Efficient amoeba join decomposition scheduling– Integration of index-lookups and cost-based optimization– Indexes for amoeba structures

•• More complex semanticsMore complex semantics– Ownerships of nodes– Scope of attributes

•• Updates of XML DataUpdates of XML Data– Detecting violation of FDs– Automatically constructs XML structures

• From unstructured data

47

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Our ContributionsOur Contributions

•• Amoeba JoinAmoeba Join– Tracks various XML structures

•• Functional DependencyFunctional Dependency– defines XML structures of interest– Conceptual change: Data model (FD) defines XML structuresData model (FD) defines XML structures

•• Amoeba Join DecompositionAmoeba Join Decomposition– makes faster the FD-based query processing

•• XML IndexingXML Indexing– A space-efficient XML indexing technique

http://www.xerial.org/ 48

Thank you!Thank you!

This is the end of the presentationThis is the end of the presentation

top related