Top Banner
XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany
54

XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

XSEarch: A Semantic Search Engine for XML

Sara Cohen

Jonathan Mamou

Yaron Kanza

Yehoshua Sagiv

Presented at VLDB 2003, Germany

Page 2: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

XSEarch an XML Search Engine

XSEarch an XML Search Engine

Our Goal:

Find the “relevant” XML fragments,

given tag names and keywords

Our Goal:

Find the “relevant” XML fragments,

given tag names and keywords

Page 3: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Excerpt from the XML Version of DBLP

<proceedings>

<inproceedings>

<author>Moshe Y. Vardi</author>

<title>Querying Logical Databases</title>

</inproceedings>

<inproceedings>

<author>Victor Vianu</author>

<title>A Web Odyssey: From Codd to XML</title>

</inproceedings>

</proceedings>

Page 4: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

A Search Example

Find papers by Vianu on the topic of “logical databases”

How can we find such papers?

Page 5: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Attempt 1: Standard Search Engine

Each document in the corpus is treated as an integral unit.

A document containing some of the three query terms is considered as a result.

Page 6: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

The document is not relevant to the query.

This does not work!!!

The document is returned BUT it does not

contain any paper on “logical databases” by VianuThis fragment does not represent

a paper by Vianu

This fragment does not represent a paper about logical databases

<proceedings> <inproceedings> <author>Moshe Y. Vardi</author> <title>Querying Logical Databases</title> </inproceedings> <inproceedings> <author>Victor Vianu</author> <title>A Web Odyssey: From Codd to XML</title> </inproceedings></proceedings>

The document contains the three query terms. Hence, it is returned by a standard search engine. BUT

Page 7: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Attempt 2: XML Query Language

FOR $i IN document(“bib.xml”)//inproceedingsWHERE $i/author contains ‘Vianu’

AND $i/title contains ‘Logical’ AND $i/title contains ‘Databases’

RETURN <result> <author> $i/author </author>

<title> $i/title </title> </result>

FOR $i IN document(“bib.xml”)//inproceedingsWHERE $i/author contains ‘Vianu’

AND $i/title contains ‘Logical’ AND $i/title contains ‘Databases’

RETURN <result> <author> $i/author </author>

<title> $i/title </title> </result>

• Complicated syntax• Extensive knowledge of the document structure required to

write the query• No mechanism for ranking results

This does work, BUT

Page 8: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Our Requirements from the Search Tool

• A simple syntax that can be used by naive users• Search results should include XML fragments and

not necessarily full documents• The XML fragments in an answer, should be

semantically related– For example, a paper and an author should be in an

answer only if the paper was written by this author

• Search results should be ranked• Search results should be returned in “reasonable”

time

Page 9: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Overall Architecture

User Interface

Query Processor Ranker

XML Files

1 2 3 4

L1 L2 L3 L4

IndexerIndices

Index Repository

Page 10: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Query Syntax and Semantics

User Interface

Query Processor Ranker

XML Files

1 2 3 4

L1 L2 L3 L4

IndexerIndices

Index Repository

User Interface

Query Processor

Page 11: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

XSEarch Query Syntax

• A query is a list of query terms

• A query term can be a – Keyword, e.g., database

– Tag, e.g., inproceedings: – Tag-keyword combination, e.g., author:Vianu

• Optionally preceded by a ‘+’

Page 12: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Appearance of logical in thefragment increases the rank ofthis fragment

Note that the different document fragments matching these query terms must be “semantically related”

Example

• Find papers by Vianu on the topic of “logical databases”

logical +database inproceedings: author:Vianulogical +database inproceedings: author:Vianu

Appearance of the tag inproceedings, in the fragment, increases the rank of this fragment

Appearance of Vianu under the tag author, in the fragment, increases the rank of this fragment

The keyword databasemust appear in the fragment

Page 13: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Semantic Relation: The Intuition

Semantic Relation: The Intuition

Page 14: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

XSEarch:

<proceedings>

<inproceedings>

<author>Moshe Y. Vardi</author>

<title>Querying Logical Databases</title>

</inproceedings>

<inproceedings>

<author>Victor Vianu</author>

<title>A Web Odyssey: From Codd to XML</title>

</inproceedings>

</proceedings>

>title>A Web Odyssey: From Codd to XML</title<

>author>Victor Vianu</author<

Good Result!

title and author elements ARE semantically related

author:Vianu title:

Page 15: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

<proceedings>

<inproceedings>

<author>Moshe Y. Vardi</author>

<title>Querying Logical Databases</title>

</inproceedings>

<inproceedings>

<author>Victor Vianu</author>

<title>A Web Odyssey: From Codd to XML</title>

</inproceedings>

</proceedings>

<title>Querying Logical Databases</title>

>author>Victor Vianu</author<

Bad Result!

title and author elements ARE NOT semantically related

XSEarch: author:Vianu title:

Page 16: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Semantic Relation: Formalization

Semantic Relation: Formalization

Page 17: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Data Model: Document TreeTags are colored

in greenData is colored in

red

proceedings

Moshe Y. Vardi

inproceedings

author title

Querying Logical Databases

authortitle

Victor Vianu

A Web Odyssey: From Codd to XML

inproceedings

GOAL:

Find pairs of semantically related titles and authors.

Page 18: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Relationship Trees

Relationship tree of n1, n2, …, nk

n1 n2nk…

Lowest common ancestor of n1, n2, …, nk

Page 19: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Our “Semantic Relation”: Interconnection

• n1,..., nk are strongly interconnected if the relationship tree of n1,..., nk does not contain two nodes with the same label

• n1,..., nk are interconnected if either

– they are strongly interconnected, or– the only nodes with the same label in the

relationship tree of n1,..., nk, are among n1,..., nk

Page 20: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

proceedings

Moshe Y. Vardi

inproceedings

author title

Querying Logical Databases

authortitle

Victor Vianu

A Web Odyssey: From Codd to XML

inproceedings

Circled nodes belong to different inproceedings entities. They ARE NOT strongly interconnected nor interconnected!

Relationship tree

Lowest common ancestor of circled

nodes

Example (1)

Page 21: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

proceedings

Moshe Y. Vardi

inproceedings

author title

Querying Logical Databases

authortitle

Victor Vianu

A Web Odyssey: From Codd to XML

inproceedings

Circled nodes belong to the same inproceedings entity. They ARE strongly interconnected, thus, interconnected!

Relationship tree

Lowest common ancestor of circled

nodes

Example (2)

Page 22: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

proceedings

Moshe Y. Vardi

inproceedings

author title

Querying Logical Databases

authortitle

Victor Vianu

Queries and Computation on the Web

inproceedings

Circled nodes belong to the same inproceedings entity, but are labeled with the same tag.

They ARE interconnected, BUT NOT strongly interconnected!

Relationship tree

Lowest common ancestor of circled

nodes

Example (3)

author

Serge Abiteboul

We can see the advantage of using interconnection rather than strong interconnection.

These two author nodes ARE semantically related.

Page 23: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Interconnection

• Based on theoretical results of “Generating relations from XML documents”, S. Cohen, Y.Kanza, Y. Sagiv, ICDT 2003.– Three types of interconnection

• We have implemented two types of interconnection• XSEarch can easily accommodate different types

of interconnection, or other semantic relations between nodes

Page 24: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Checking Whether Two Nodes Are Interconnected

• Given a document T, it is possible to check whether nodes n and n’ are interconnected in O(|T|) time

• Too expensive to do it during query processing!

• During query processing, we need to check whether pairs of nodes are interconnected

Page 25: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Interconnection Index

• Is built offline• Allows for checking interconnection

between two nodes, during query processing, in O(1) time

• We have two implementations – as a hash table– as a symmetric matrix

• The Indexer is responsible for building the Interconnection Index

Page 26: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

IndexerUser

Query Processor Ranker

XML Files

1 2 3 4

L1 L2 L3 L4

IndexerIndices

Index Repository

Page 27: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

• For each pair of nodes, check whether this pair is interconnected– There are O(|T|2) pairs– Checking interconnection is in O(|T|) time

• As a result, checking for interconnection of all pairs of nodes in T is in O(|T|3) time

Too expensive also if it is done offline!!!

Building the Interconnection Index: Naïve Approach

Page 28: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

• Idea: Checking whether two nodes are interconnected can be done by checking interconnection between their parents/children

• There are two characterizations of nodes interconnection– For child-ancestor nodes– For non child-ancestor nodes

Building the Interconnection Index: Dynamic Programming Approach

Page 29: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Interconnection Characterization: n is an ancestor of n’

n and n’ are interconnected

if and only if:– the parent of n’ is strongly

interconnected with n – the child of n on the path to n’

is strongly interconnected with n’ parent of n’

child of n

n

n’

– the parent of n’ is strongly

interconnected with n

parent of n’

n

child of n

n’

– the child of n on the path to n’

is strongly interconnected with n’

Page 30: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Interconnection Characterization: n is not an ancestor of n’

n and n’ are interconnected

if and only if:– the parent of n’ is strongly

interconnected with n – the parent of n is strongly

interconnected with n’

n’n

parent of n’parent of n

lca

– the parent of n is strongly

interconnected with n’

n’

parent of n

– the parent of n’ is strongly

interconnected with n

n

parent of n’

Page 31: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Building the Interconnection Index Using Dynamic Programming

• Theorem: Let T be a document. Then it is possible to determine interconnection of all pairs of nodes in T in O(|T|2) time

• Proof hint: – Derive nodes numbers in T by a depth-first

traversal of T– Compute the index using dynamic

programming, based on the characterizations

Page 32: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Query Processing

• Document fragments are extracted using the interconnection index and other indices

• Extracted fragments are returned ranked by the estimated relevance

Page 33: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Ranker

User

Query Processor Ranker

XML FilesIndexerIndices

Index Repository

1 2 3 4

L1 L2 L3 L4

Page 34: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Ranking Factors

Several factors increase the rank of a result

• Similarity between query and result

• Weight of labels appearing in the result

• Characteristics of result tree

Page 35: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Query and Result Similarity

TFILF– Extension of TFIDF, classical in IR

– Term Frequency: number of occurrences of a query term in a fragment

– Inverse Leaf Frequency: number of leaves containing a query term divided by number of leaves in the corpus

Page 36: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

TFILF• Term frequency of keyword k in a leaf node nl

• Inverse leaf frequency

TFILF is the product between tf and ilf

Page 37: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Weight of Labels

• Some labels are considered more important than others– Text under an element labeled with title is

more “important” than text under element labeled with section

• Label weights can be– system generated– user defined

Page 38: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Relationship between Nodes• Size of the relationship tree: small fragment

indicates that its nodes are closer, and thus, probably, “more related”

article: title:XML

2 nodes

3 nodes

This fragment will obtain an higher rankarticle

title

XML

article

section

title

XML

Page 39: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Relationship between Nodes• Ancestor-descendant relationships between

a pair of nodes in a fragment, indicates “strong relation” between these nodes

section: title:XML

article

title

XML

section

section node is an ancestor of title node

This fragment will obtain an higher rank

section

title

article

XML

Page 40: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Experimental ResultsExperimental Results

Page 41: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Hardware and Software Used

• Language: Java

• Processor: 1.6 GHZ Pentium 4

• RAM: 2 GB (limited to 1.46 GB by JVM)

• OS: Windows XP

Page 42: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Choosing the Implementation for the Interconnection Index

• We have experimented the two implementations of the interconnection index1. IIH: the index is an hash table2. IIM: the index is a symmetric matrix

• We compare the two implementations – Cost of building the index– Cost of query processing, i.e., using the index

Page 43: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

• IIM is better than IIH, because of the additional overhead of hashing

Time For Building Indices

XML corpus

Size (KB)

Number of nodes

IIM time (ms)

IIH time (ms)

Dream1463,3602936

Hamlet2816,635114185

Sigmod70421,2461,5521,729

Mondial1,19849,4226,2317,837

• Both implementations are reasonable

Page 44: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

On the Fly Indexing (OFI)

• Fully building the indices as a preprocess of querying is expensive in memory for “huge” corpuses!– Also expensive in time because of the additional

overhead of using virtual memory

• Instead, compute interconnection index incrementally on-the-fly during query processing for each pair that must be checked– By how much will query processing be slowed

down?

Page 45: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Time For Building Indices: Comparing IIH, IIM, OFI

XML corpus

Size (KB)

Number of nodes

OFI time (ms)

IIH time (ms)

IIM time (ms)

Dream1463,3600.63629

Hamlet2816,6351.1185114

Sigmod70421,2462.21,7291,552

Mondial1,19849,42210.07,8376,231

For these corpuses, OFI time is less than 10 ms.

Actually it is the time to build all the indices other than the interconnection index.

Page 46: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Query Execution Time

• We generated 1000 random queries for the Sigmod Record corpus

• Each query had:– At most 3 optional search terms– At most 3 required search terms

• We checked time with IIH, IIM and OFI

Page 47: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

0.1

1

10

100

1000

Time (milliseconds)

Nu

mb

er

of

Qu

eri

es

IIH

IIM

IIH/IIM: Query Processing Time

• Note: Logarithmic scale

• Both approaches lead to similar results

• Average run time for queries: 35 ms

Page 48: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

• After processing the 1000 queries, 0.75% of all pairs of nodes were checked for interconnection.•Space saved in main memory

Slowdown in response time not too large!Locality property: queries tend to be similar in the parts of the document that they may access

• More than 50% of the queries processed in under 10 ms

513

144199

96

2317

7

11

10

100

1000

0.01 0.1 1 10 100 1000 10000 100000

Time (sec)

Nu

mb

er

of

Qu

eri

es

OFI: Query Processing Time

Page 49: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

How Good are the Results?

• We measured recall & precision for the query:– Find papers written by Buneman that contain the

keyword database in the title

• We tried two different queries that reflect different amounts of user knowledge– Kw: +Buneman +database (classical search

engine query)– Tag-kw: +author:Buneman +title:database

• Corpus: Sigmod, DBLP

Page 50: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

• We computed the "correct answers" using XQuery

• Recall

Perfect recall, i.e., XSEarch returns all the correct answers

• Precision at n

Precision and Recall

correct returned answers

correct answers

correct answers in the first n returned answers

n

Page 51: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Precision at 5, 10 and 20

0.5

0.6

0.7

0.8

0.9

1

P@5: KW P@10:KW

P@20:KW

P@5:KW+TAG

P@10:KW+TAG

P@20:KW+TAG

DBLP

Sigmod

Sigmod: Perfect precision

DBLP: 0.8/0.9 for query containing only keywords

Combining tags and keywords leads to perfect precision

Page 52: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Conclusions

• Paradigm for querying XML combining IR and database techniques

• Returns semantically related fragments, ranked by estimated relevance

• Combining tags and keywords in the query leads to good results

Page 53: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Conclusions

• Efficient index structures– IIM/IIH for “small” documents– OFI for “big” documents

• Efficient evaluation algorithms– Dynamic algorithm for computing

interconnection

• Extensible implementation– The system can easily accommodate different

types of semantic relations between nodes, other than interconnection

Page 54: XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.

Thank You.

Questions?