Top Banner
http://www.xerial.org/ 1 Taro L. Saito Taro L. Saito University of Tokyo <[email protected]> Purifying XML Structures Purifying XML Structures Ph.D. Defense Ph.D. Defense
48

Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense Outline • Introduction – XML and Structural Fluctuation

Oct 14, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

http://www.xerial.org/ 1

Taro L. SaitoTaro L. SaitoUniversity of Tokyo

<[email protected]>

Purifying XML StructuresPurifying XML StructuresPh.D. DefensePh.D. Defense

Page 2: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

2

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

OutlineOutline

•• IntroductionIntroduction– XML and Structural Fluctuation– Amoeba Join

•• Purifying XML StructuresPurifying XML Structures– Functional Dependencies for XML– Amoeba Join Decomposition– Ubiquitous Keys

•• ImplementationImplementation– Amoeba Join Processing Algorithms– XML Indexing– Experimental Results

•• Conclusions Conclusions – Applications– Summary of Contributions & Future Work

Page 3: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

3

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

IntroductionIntroduction

•• XML XML (Extensible Markup Language)(Extensible Markup Language)

– A markup language representing a tree structure

– Since 1996, XML has been broadly used as a data representation format

•• Major drawbacksMajor drawbacks– Hierarchical representation of data

is too complex• for both of human and computer

programs• reminiscences of 1970s’ discussion

– Relational v. s. Hierarchical DB

– There exist many alternative tree structures

• to represent a same data model

<bookstore><bookstore><order><order>

<customer><customer>JohnJohn</customer></customer><book><book>

<title><title>Data on the WebData on the Web</title></title></book></book>

</order></order></bookstore></bookstore>

orderorder

customercustomer bookbook

titletitle

Page 4: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

4

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Structural FluctuationStructural Fluctuation

•• Differently Structured XML DocumentsDifferently Structured XML Documents– representing a same data model e.g. Amazon.com

• for order, customer, book nodes

– The hierarchical order of order and customer is reversed.– The order node is behind the pending node.

orderorder

customercustomer bookbook“cancelled”

customercustomer

bookbook

pendingpending

orderorder notenote

Page 5: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

5

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Querying Structural FluctuationQuerying Structural Fluctuation

•• Standards of XML Processing: XPath, SAX, DOM, etc. Standards of XML Processing: XPath, SAX, DOM, etc. •• Many parse states:Many parse states:

– If we find an order, then parse customer and book– or if we first find an customer, then parse pending/order and book ...–– Such query processing is tedious and errorSuch query processing is tedious and error--prone!prone!

•• Why we need different programs to parse the same meaning XML Why we need different programs to parse the same meaning XML data?data?

orderorder

customercustomer bookbook“cancelled”

customercustomer

bookbook

pendingpending

orderorder notenote

Page 6: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

6

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Structural FluctuationsStructural Fluctuations

•• In general, the number of structural fluctuations of In general, the number of structural fluctuations of nn nodes is nnodes is n(n(n--1)1)

– Enumeration of labeled trees of n nodes

Page 7: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

7

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Current SolutionCurrent Solution

•• Disallow structural fluctuations by using a schemaDisallow structural fluctuations by using a schema– XML Schema, DTD, RelaxNG, etc.

•• However, fixing a tree structure involves irrelevant work in However, fixing a tree structure involves irrelevant work in defining a data model.defining a data model.– Why we have to choose only one tree structures?

Page 8: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

8

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Heuristic ApproachHeuristic Approach

•• SLCA (Smallest Lowest Common Ancestor)SLCA (Smallest Lowest Common Ancestor)– [Li, VLDB2004], [Xu, SIGMOD2005]

– An lca node that does not contain other lca nodes.– However, it easily leads to unintended results

orderorder

customercustomer bookbook

slca of (customer, book)

datadata

customercustomer

Page 9: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

9

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Amoeba JoinAmoeba Join

•• Amoeba Join:Amoeba Join: AJ(order, customer, book)AJ(order, customer, book)– [WebDB 2006]– retrieves node tuple such that

• one of (order, customer, book) nodes is a common ancestor of the others.

– Handles every structural fluctuation

orderorder

customercustomer bookbook

amoebaamoeba

“cancelled”

customercustomer

bookbook

pendingpending

orderorder notenote

amoeba rootamoeba root

Page 10: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

10

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Semantics of XML StructuresSemantics of XML Structures

•• Semantics implied in XML dataSemantics implied in XML data– Each order node should have a single book node

• Invalid structure might be retrieved without considering such semantics of data.

– Instances of such invalid structures could be numerous

•• To represent semantics of XML data, we introduce To represent semantics of XML data, we introduce functional dependenciesfunctional dependencies for XMLfor XML

Page 11: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

11

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Functional Dependency (FD)Functional Dependency (FD)

•• Functional DependencyFunctional Dependency– X → Y : if two tuples p, q agree with X, then also agree with Y

order book title1 b1 Database Systems2 b1 Database Systems3 b2 Data on the Web

order book1 b12 b13 b2

book titleb1 Database Systemsb2 Data on the Web

•• FDs: order FDs: order →→ book, book book, book →→ titletitle

•• FD is generally used to avoid redundancies of dataFD is generally used to avoid redundancies of data– Normal Form

Page 12: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

12

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Data Modeling & FDData Modeling & FD

•• FD has an essential role in data modelingFD has an essential role in data modeling– describe one-to-one, one-to-many, many-to-many relationships

• ex. ER (Entity-Relationship) diagram, UML (Unified Modeling Language)

•• ExampleExample– order → book, order → customer

• An order has a book. An order has a customer. • A book has many orders. A customer has many orders (one-to-many)

– book → title, title → book• A book has a title; a title belongs to a book (one-to-one)

– customer, book → order• An order connects many customers and books (many-to-many)

orderordercustomercustomer bookbook1

m n1

titletitle

1

1

Page 13: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

13

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Functional Dependencies for XMLFunctional Dependencies for XML

•• Previous Work of FDs for XMLPrevious Work of FDs for XML– [Buneman et al., WWW2001], [Arenas and Libkin, TODS2004]

– based on fixed paths• Because there was no counter part of relation (tables) in XML

– e.g. /order /order →→ /order/book/order/book• Structural fluctuations are not allowed:

– In reality, however, the constraint on the path, a book must be a child of an order, is too strong.

– Their definition has no loss-less decomposition

Page 14: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

14

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Relation in XMLRelation in XML

•• Relation in XML allows a zigzag shapeRelation in XML allows a zigzag shape•• For an FD: For an FD: book, customer book, customer →→ orderorder

– (book, customer, order) must be an amoeba

Page 15: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

15

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

A set of A set of FDsFDs defines XML structuresdefines XML structures

•• Traditional Approach:Traditional Approach:– XML data (Structured Data) -> Data Model

•• Our approach: Our approach: Data Model (FD) Data Model (FD) --> XML Structures> XML Structures– Allows various XML structures to describe a data model– Enhancing expressive power of XML databases

Page 16: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

16

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Amoeba Join Satisfying Amoeba Join Satisfying FDsFDs

•• AJAJF F (order, book, customer) (order, book, customer) – retrieves a relation in XML satisfying a set F of FDs

•• Makes easier managing multiple hierarchies of XML tree structureMakes easier managing multiple hierarchies of XML tree structuress– An amoeba join AJF (order, book, customer) can track D2

Page 17: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

17

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Amoeba Join DecompositionAmoeba Join Decomposition

Page 18: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

18

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

FD Based XML Query ProcessingFD Based XML Query Processing

•• No explicit path structures are requiredNo explicit path structures are required•• Examples:Examples:

– FDs• book, customer → order ・order → book ・order → customer

– A query for book and order node: AJF (book, order)• book and order nodes compose amoebas

– A query for book and customer nodes:• AJF (book, customer)

– book and customer nodes might be connected through order nodes

• Thus, AJF (book, customer, order) is evaluated

•• Relation inRelation in XML is dynamically determined according to query targetsXML is dynamically determined according to query targets

orderordercustomercustomer bookbook1

m n

1

titletitle

Page 19: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

19

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Functional Dependencies and KeysFunctional Dependencies and Keys

•• Key is a special case of a functional Key is a special case of a functional dependencydependency– e.g. order (id) → book, customer

• order (id) is a key

•• Using a relation in XML, we can define keys for XMLUsing a relation in XML, we can define keys for XML– [order@id] → book, customer

• Given an order id, we can uniquely determine book and title nodes• XML structures: <<order, book>>, <<order, customer>>

•• More general description of keysMore general description of keys– In [Buneman, et al. WWW2001], it is not allowed to reverse the

position of order and book nodes

JohnJohnb1b111

LucyLucyb2b222

customercustomerbookbookorder order

Page 20: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

20

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Ubiquitous KeysUbiquitous Keys

Page 21: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

21

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Querying without using StructuresQuerying without using Structures

•• AJ(book, [pending, order, title])AJ(book, [pending, order, title])– book nodes are merged using ubiquitous keys

Page 22: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

http://www.xerial.org/ 22

Amoeba Join ProcessingAmoeba Join Processing

Page 23: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

23

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

orgorg

managermanager locationlocation

managermanager

locationlocation“Kyoto”

“Tokyo”

namename

“David” “Michael”

departmentdepartment orgorg

companycompany

Sweep Amoeba Join AlgorithmSweep Amoeba Join Algorithm

•• Fetch all input nodesFetch all input nodes– AJ(org, manager, location)– Sort input nodes in their document orders.

•• Sweep sorted input nodesSweep sorted input nodes– Assume the smallest node in the input nodes as an amoeba root.– Search their descendant regions for components of amoebas

amoebaamoebaamoebaamoebaorgorg

managermanager locationlocation

managermanager

orgorg

locationlocation

Page 24: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

24

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Disk I/O OptimizationDisk I/O Optimization

•• AJ(org, manager, location = AJ(org, manager, location = ““TokyoTokyo””))– Choose pivot nodes from a small input domain– Traverse upward to find amoeba root candidates

– Search space for amoeba is localized under the amoeba root candidates.

orgorg

managermanager locationlocation

managermanager

locationlocation“Kyoto”

““TokyoTokyo””

namename

“David” “Michael”

departmentdepartment orgorg

companycompany

locationlocation PivotPivot

orgorg

managermanager

amoeba root candidateamoeba root candidate

Page 25: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

http://www.xerial.org/ 25

XML IndexingXML Indexing

Page 26: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

26

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

History of XML IndexingHistory of XML Indexing

•• A hundreds of XML indexing papers A hundreds of XML indexing papers ……. . – tailored to specific queries

• XPath query, structural-join (A//D), twig-queries, text search, etc.

– from many research areas• Database Community

– DataGuides (1997), 1-index (1999), XR-tree (2002), PathFinder(2006)

– Node labeling (static or updatable)» Dewey order, ORDPATH(2004), BLAS(2004), Pbi (2005)

• Information Retrieval (IR)– inverted indexes for text data. SLCA (2005)

• Compressed Index– XBW (Ferrangina, WWW2006)

Page 27: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

27

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Multidimensional Aspects of XMLMultidimensional Aspects of XML

•• TreeTree--Structure IndexStructure Index– Ancestor, Descendant (subtree),

Sibling•• PathPath--Structure IndexStructure Index

– Suffix-path (//headline/item)

•• An XML Index that can process An XML Index that can process both of the structuresboth of the structuressimultaneouslysimultaneously is strongly requiredis strongly required

Page 28: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

28

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Our ApproachOur Approach

•• [DASFAA2007][DASFAA2007]

•• Integrating treeIntegrating tree--structure and path indexes structure and path indexes – As a multidimensional index

• (start, end, level, path)

– It can be implemented on top of the B+-tree

•• Why B+Why B+--trees?trees?– Index structures and transaction management, recovery,

logging, caching etc. are interdependent.

– We already have many transaction management techniques on B+-trees

• Transaction management on R-tree is not seriously supported.

Page 29: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

29

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

InvertedInverted--Path IndexPath Index

•• Align inverted paths in the lexicographical orderAlign inverted paths in the lexicographical order– facilitates suffix path queries

•• Examples (suffixExamples (suffix--path query range):path query range):– //item [6, 11)– //headline/item [6, 8)

Page 30: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

30

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

ZZ--OrderOrder

•• Align multidimensional points (nodes) in zAlign multidimensional points (nodes) in z--orderorder– Interleave function gives z-order in the multidimensional space

•• Each step in zEach step in z--orders splits slices into twoorders splits slices into two

Page 31: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

31

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Range QueryRange Query

•• Traverse B+Traverse B+--tree in the order of ztree in the order of z--orderorder

Page 32: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

http://www.xerial.org/ 32

Experimental ResultsExperimental Results

Page 33: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

33

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

ImplementationImplementation

•• Xerial Xerial – http://www.xerial.org/– XML Database Management System

• XML data is multi-dimensionally indexed• supporting amoeba joins & XPath queries

– Implemented in C++• about 150,000 lines of codes

– Query compiler & scheduler, query processing algorithms– Database indexing, XML processor, etc.

•• Machine environment for experimentsMachine environment for experiments– Windows XP notebook– Pentium M 2GHz, 1GB Main Memory– 5,400 rpm HDD (100GB)

Page 34: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

34

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Database SizeDatabase Size

•• Data set: Data set: XMarkXMark Benchmark XML Document Benchmark XML Document •• Xerial is spaceXerial is space--efficientefficient

Page 35: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

35

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

SuffixSuffix--Path Query PerformancePath Query Performance

•• Xerial & pathXerial & path--start index is fasteststart index is fastest

Page 36: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

36

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Subtree Retrieval PerformanceSubtree Retrieval Performance

•• All of the indexes shows similar performanceAll of the indexes shows similar performance– XML nodes are sorted in the order of start values

Page 37: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

37

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Ancestor RetrievalAncestor Retrieval

•• The number of the previous nodes of a context node The number of the previous nodes of a context node affects the ancestoraffects the ancestor--query performance.query performance.

Page 38: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

38

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Sibling Query PerformanceSibling Query Performance

•• Without indexes for levelWithout indexes for level--values, retrievals of sibling values, retrievals of sibling nodes are inefficientnodes are inefficient

Page 39: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

39

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Amoeba Join PerformanceAmoeba Join Performance

•• AlgorithmAlgorithm– QK: Quicker, SW: Sweep, BF: Brute Force

•• IndexIndex– I: Index Scan, S: Sequential Scan

•• Quicker algorithm is fastest when we can Quicker algorithm is fastest when we can localize search regionslocalize search regions

Page 40: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

40

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Improvement by AJ DecompositionImprovement by AJ Decomposition

•• Without decomposing amoeba joins, the number of Without decomposing amoeba joins, the number of XML structures to be retrieved explodes.XML structures to be retrieved explodes.

Page 41: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

http://www.xerial.org/ 41

PerspectivesPerspectives

Page 42: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

42

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

ApplicationsApplications

•• Our methods can be applied various XML databasesOur methods can be applied various XML databases

•• Examples of promising applicationsExamples of promising applications•• File SystemsFile Systems

– Represent files with XML format• reorganization and enhancing information of files with tags

•• BioinformaticsBioinformatics– Reorganization of data is frequent

• Statistical analysis (classification, transformation, cleansing, etc.)

– Integration of various data sources is required

Page 43: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

43

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

SCMDSCMD

•• SCMD SCMD ((Saccharomyces CerevisiaeSaccharomyces Cerevisiae Morphological Database)Morphological Database)– [NAR04], [NAR05], [PNAS05]

Page 44: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

44

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Deep Copies of XML DataDeep Copies of XML Data

cellcell

sizesize

roundnessroundness

clusterclusterpropertyproperty

functionfunction

genegene

<cell><size>…</size><roundness>…</roundness><cluster>

<function>…</function><property>…</property>

</cluster></cell>

<cell><size>…</size><roundness>…</roundness><cluster>

<function>…</function><property>…</property>

</cluster></cell>

<cluster><function>…</function><property>…</property><cell>

<size> … </size><roundness>..</roundness>

</cell></cluster>

<cluster><function>…</function><property>…</property><cell>

<size> … </size><roundness>..</roundness>

</cell></cluster>•• Many duplicates (deep copies) Many duplicates (deep copies)

of dataof data

Page 45: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

45

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Shallow Copies of XML DataShallow Copies of XML Data

•• GraphGraph--structured data model can structured data model can be decomposed into several treesbe decomposed into several trees

•• To connect nodes in trees, we need To connect nodes in trees, we need shallow copies of nodes. shallow copies of nodes.

cellcell

sizesize

roundnessroundness

clusterclusterpropertyproperty

functionfunction

genegene<cell id=“1”>

<size>…</size><roundness>…</roundness>

</cell>

<cell id=“1”><size>…</size><roundness>…</roundness>

</cell>

<cluster><function>…</function><property>…</property><cell id=“1”/>

</cluster>

<cluster><function>…</function><property>…</property><cell id=“1”/>

</cluster>

•• With FDWith FD--based query processingbased query processing– It becomes easier to manage shallow-copy representation of XML data

Page 46: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

46

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Future WorkFuture Work

•• Query OptimizationQuery Optimization– Efficient amoeba join decomposition scheduling– Integration of index-lookups and cost-based optimization– Indexes for amoeba structures

•• More complex semanticsMore complex semantics– Ownerships of nodes– Scope of attributes

•• Updates of XML DataUpdates of XML Data– Detecting violation of FDs– Automatically constructs XML structures

• From unstructured data

Page 47: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

47

Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense

http://www.xerial.org/

Our ContributionsOur Contributions

•• Amoeba JoinAmoeba Join– Tracks various XML structures

•• Functional DependencyFunctional Dependency– defines XML structures of interest– Conceptual change: Data model (FD) defines XML structuresData model (FD) defines XML structures

•• Amoeba Join DecompositionAmoeba Join Decomposition– makes faster the FD-based query processing

•• XML IndexingXML Indexing– A space-efficient XML indexing technique

Page 48: Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D. Defense  Outline • Introduction – XML and Structural Fluctuation

http://www.xerial.org/ 48

Thank you!Thank you!

This is the end of the presentationThis is the end of the presentation