Top Banner
1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University Dayton, OH-45435, USA
36

1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

Jan 03, 2016

Download

Documents

Blake Bishop
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

1

Flexible Querying of XML Documents

Krishnaprasad Thirunarayan and Trivikram Immaneni

Department of Computer Science and EngineeringWright State UniversityDayton, OH-45435, USA

Page 2: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

2

Talk Outline

Goal (What?)

Background and Motivation (Why?)

Query Language and Examples (What?)

Implementation Details (How?)

Evaluation and Applications (Why?)

Conclusions

Page 3: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

3

Goal

Page 4: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

4

Develop a keyword-based XML Query Language and its Semantics that is flexible and sufficiently expressive easy to use (for query formulation)

Implement, reusing mature software components, for efficient indexing and search

Page 5: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

5

Background and Motivation

Page 6: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

6

XML vs Text Documents

DATA: Exploit metadata/markup and aggregation structure implicit in XML documents For expressiveness and precision

QUERY: Obtain progressively improved extractions using convenient keyword-based queries in contrast with accurate extractions using complex XML-based queries

Page 7: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

7

Relationship to Other Work

Extends XSEarch (Cohen et al) Expressive power: Incorporates

attributes and their values Equivalence (E.g., RDF) :

<T A="s"/> vs <T> <A> s </A> </T>

Page 8: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

8

Invariance under Refinement <T> <A> word_1 and word_2 </A> </T>vs <T> <A> <B> word_1 </B> and <C> word_2 </C> </A> </T>

Page 9: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

9

Coherence

Interconnectedness : Infer related pieces of information using aggregation implicit in XML Cohen et al : Name equivalence

XSEarch

Li et al: Structural equivalence Scheme-free XML

Guo et al : Completeness XRANK

Page 10: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

10

Information Retrieval Explore robust relevance ranking

strategy to deal with high recall Variation on TFIDF Naïve implementation computationally

prohibitive Extension beyond “type-delimited-

document” unclear

Page 11: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

11

Query Language and Examples (What?)

Page 12: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

12

Query Syntax

Entity-Attribute-KeywordSearch Terms e:a:k e:a:, :a:k, e::k e::, a::, ::k

Signed/optional Search Terms + e:a:k vs e:a:k

Page 13: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

13

Single Search Term Satisfaction

e:a:k

The search term e:a:k is satisfied by a tree containing a subtree with the top element e that is associated with the attribute a with value containing k, or a subelement a with descendant text node containing k.

Page 14: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

14

Example (Mondial)

<country id="f0_149" name="Austria" capital="f0_1467" population="8023244"

datacode="AU" total_area="83850" population_growth="0.41"

infant_mortality="6.2" ... government="federal republic" ...> ...

</country>

:name:Vienna is satisfied by

<province id="f0_17447" name="Vienna" ...>

<city id="f0_1467" country="f0_149" province="f0_17447" ...>

<name>Vienna</name> <population year="94">1583000</population> </city>

</province> ...

name::Vienna is satisfied by a part of it

<name>Vienna</name>.

Page 15: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

15

Example (Heterogeneity)

<author><name>Adam Dingle</name></author>

<author name="A. Dingle" ></author> <article id="3"> @inproceedings{IMN97, author="Adam Dingle and Ed

MacNair and Thao Nguyen", … </article>

author:name:Dingle misses the last one.

Page 16: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

16

Query Answer Candidate

Query Answer Candidate for the query Q(t_1,t_2,...,t_m), is a Most preferred satisfying collection of

trees (P_1,P_2,...,P_m) Precise : smallest enclosing

Adequate : optional search terms satisfied as much as possible

Page 17: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

17

Query Answer

Query Answer for the query Q(t_1,t_2,...,t_m), is a Query Answer Candidate

(P_1,P_2,...,P_m) in which Trees P_i’s are Interconnected

Specifies trees related to the same “real-world” entity

Page 18: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

18

Interconnectedness (Cohen et al)

Two subtrees T_a and T_b are said to be interconnected if the path from their roots to the lowest common ancestor does not contain two distinct nodes with the same element, or the only distinct nodes with the same element are these roots.

Page 19: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

19

Interconnectedness (Li et al)

Two subtrees T_a and T_b are said to be interconnected, if the path from T_a's root to their lowest common ancestor in the tree does not contain another node that is the lowest common ancestor of T_a and a distinct subtree T_b‘ , where T_b' has the same root element label as T_b.

Page 20: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

20

Interconnectedness (Two Approaches)

Page 21: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

21

Implementation Details (How?)

Page 22: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

22

Tools Used

Apache Lucene 2.0 APIs in Java A high-performance, text search

engine library with smart indexing strategies.

Further tuned for memory-centric operation in contrast with disk-centric defaults

SAXParser APIs

Page 23: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

23

Mapping to Lucene

XML documents to Lucene documents for indexingXML keyword-based queries to Lucene queries for searchingENCODING XML fragment of an XML document is

referred to internally using the filename and the XPath (of the XML fragment's root from the XML document root)

Page 24: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

24

Evaluation and Application (Why?)

Page 25: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

25

ExperimentsDATASETs: Sigmod, Mondial, and DBLP.

PLATFORMS: For Sigmod and Mondial datasets: HP xw9300

Workstation with 2 GHz AMD Opteron dual-core processor (270), 4 GB of main memory, and 250 GB 7200 rpm hard drive, running 32-bit Windows XP.

(java -Xms750M -Xmx1500M). For DBLP dataset: SUN Ultra-40 Workstation with 2.4

GHz dual AMD Opteron dual-core processor (280), 8GB of main memory, and 250GB 7500 rpm hard drive, running 64-bit Solaris 10.

(java -Xms1000M -Xmx3600M).

Page 26: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

26

Dataset Sizes

DATASET SIZE

Sigmod 468 KB

Mondial 1743 KB

DBLP 337 MB

Page 27: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

27

Dataset Indexing via Lucene

DATASET INDEXING TIME

INDEX SIZE

Sigmod 32 sec 6 MB

Mondial 180 sec 16 MB

DBLP 36 hrs 4 GB

Page 28: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

28

Query Answer: Computation Time vs Display Time

DATASET SIMPLE QUERY

COMPLEXQUERY

Sigmod 35 ms / 1 sec

400 ms / 3 min

Mondial 25 ms / 350 ms

1 sec / 2 min

DBLP 335 ms / 1 sec

---

Page 29: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

29

More Subtle Example

In Extended Paper: A Coherent Keyword-Based

XML Query Language

Page 30: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

30

Pubs.xml

<publications>- <book>  <title>Modern Information Retrieval</title>   <author>Ricardo Baeza-Yates</author>   <author>Berthier Ribeiro-Neto</author> - <chapter>  <title>Digital Libraries</title>   <author>Edward A. Fox</author>   <author>Ohm Sornil</author>   </chapter>  </book>- <article>  <title>The Anatomy of a Large-Scale Hypertextual Web Search Engine</title>   <author>Sergey Brin</author>   <author>Lawrence Page</author>   </article>- <article>  <title>An Algorithm for Suffix Stripping</title>   <author>M.F.Porter</author>   </article>- <article> <title>Indexing by Latent Semantic Analysis</title>   </article>  </publications>

Page 31: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

31

Characteristics of Pubs.xml

Total number of authors = 7Total number of titles = 5

Title Distribution = 1 book (with 1 chapter) + [3 articles] Author Distribution =

2 ( 2 ) + [ 2 + 1 + 0 ]

Page 32: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

32

Queries to Pubs.xml (Answer counts)

Arbitrary mix and match of authors and titles = 7 * 5 = 35

author::, title:: (8 hits)+author::, +title:: (7 hits)+author::, +title::, +author:: (4 hits)

Page 33: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

33

Completeness ( ::pWord, ::qWord) (Guo et al)

Page 34: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

34

Conclusions

Page 35: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

35

Developed declarative semantics for keyword-based XML Query language with an effective query answering algorithmDeveloped a notion of interconnectedness that provides coherent answersImplemented using Lucene 2.0 APIs Indexing: Time and Space Intensive

But Query Answering: Quick

Page 36: 1 Flexible Querying of XML Documents Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University.

36

THANK YOU!