Top Banner
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University Jayavel Shanmugasundaram Yahoo! Research 2008. 02. 21. Summarized by Dongmin Shin, IDS Lab., Seoul National University Presented by Dongmin Shin, IDS Lab., Seoul National University
33

Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Dec 28, 2015

Download

Documents

Trevor McCarthy
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Efficient Keyword Search over Virtual XML Views

Feng Shao and Lin Guo and Chavdar Botev

and Anand Bhaskar and Muthiah Chettiar and Fan Yang

Cornell University

Jayavel Shanmugasundaram

Yahoo! Research

2008. 02. 21.Summarized by Dongmin Shin, IDS Lab., Seoul National University

Presented by Dongmin Shin, IDS Lab., Seoul National University

Page 2: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Index

Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work

2

Page 3: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Index

Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work

3

Page 4: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Introduction

Fundamental assumption of tradi-tional information retrieval systems

4

The set of documents being searched

is materialized.

Page 5: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Introduction

But

5

The view is often virtual (unmaterial-ized)

Aggregator may not have resources to materialize all the data

If the view is materialized, the contents of the view may be out-of-date or maintaining the view may be expensive

The data sources may not wish to provide the entire data

Page 6: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Introduction

Example Personalized views : MyYahoo or Microsoft Sharepoint

– There are many users and their content is often overlapping

– It could lead to data duplication and its associated space-over-head

Information integration

6

Page 7: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Introduction

Efficiently evaluating keyword search queries

over virtual XML views

7

Need

Page 8: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Index

Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work

8

Page 9: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Background

9

Page 10: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Background

XML Scoring

tf(e,k) : the number of distinct occurrences of the key-word k in element e and its descendants

idf(k) =

score(e,Q) =

10

TF-IDF method

Page 11: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Index

Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work

11

Page 12: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

System Overview

(1) Keyword queries over virtual views

12

(2) The parser redirects the query to the Query Pattern Tree(QPT) Generation Module

(3) QPT is sent to the Pruned Document Tree(PDT) Genera-tion Module

(4) Generate PDTs using only the path indices and inverted list indices

(5) Rewritten query and PDTs are sent to Evaluator(6) Produce the view that contains all view elements with pruned content

(7) Elements are scored, only those with highest scores are fully materialized using document storage

Page 13: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

System Overview

XML Storage Dewey IDs

– Popular id format

– Hierarchical numbering scheme

– ID of an element contains the ID of its parent

13

Page 14: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

System Overview

XML Indexing Path indices

– Evaluate XML path and twig(i.e., branching path)

– Store XML paths with values in a relational table

– Use indices such as B+-tree

– One row for each unique

(Path, Value) pair

– IDList : the list of ids of

all elements on the path

– B+-tree index is built on the (Path, Value) pair

14

Page 15: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

System Overview

Inverted list indices– Store the list of XML elements that directly contain the keyword

for each keyword in the document collection

15

Page 16: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Index

Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work

16

Page 17: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

QPT(Query Pattern Tree) Generation Module

17

V : used for query evaluationC : used for result materializa-tion

Page 18: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Index

Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work

18

Page 19: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

PDT Generation Module

Output Only contains elements that correspond to nodes in the

QPT Only contains element values that are required during

query evaluation

Advantage Query evaluation is likely to be more efficient and scalable

– Since PDT is much smaller than the underlying data

Allows us to use the regular(unmodified) query evaluator– PDT is in regular XML format

19

Page 20: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

PDT Generation Module

Key Idea An element e in the document corresponding to a node n in

the QPT is selected for inclusion only if it satisfies three types of constraints(1) Ancestor constraint – an ancestor element of e that corre-

sponds to the parent of n in the QPT should also be selected

(2) Descendant constraint – for each mandatory edge from n to a child of n in the QPT, at least one child/descendant element of e corresponding to that child of n should also be selected

(3) Predicate Constraint – if e is a leaf node, it satisfies all predi-cates associated with n

20

Page 21: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

PDT Generation Module

PrepareList

(1) Issues a lookup on path indices for each QPT node that has no mandatory child edges

(2) Identifies nodes that have a ‘v’ annotation to obtain values and ids

(3) Looks up inverted lists indices and retrieves the list of Dewey IDs containing the keywords along with tf values

21

Page 22: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

PDT Generation Module

Candidate Tree(CT)

22

Page 23: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

PDT Generation Module

Step 1 : adding new IDs– Adds the current minimum IDs in pathLists

23

Page 24: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

PDT Generation Module

Step 2 : creating PDT nodes– Create PDT nodes using CT nodes

– Top-down

– Check DM value of each CT node if it is “1”, create it in pdt cache If not, check children of that node

If DM value of that children node is “1”, create is in pdt cache of parent node

24

Page 25: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

PDT Generation Module

Step 3 : removing CT nodes– Bottom-up

– Check if each node satisfies ancestor constraints If not, remove If so, propagate to the pdt cache of the ancestor

– If some node has no children and does not satisfy descendant constraints, remove

25

Page 26: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

PDT Generation Module

– When we remove the root node “books”, all IDs in its pdt cache will be propagated to the result PDT

26

Page 27: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

PDT Generation Module

27

Page 28: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Index

Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work

28

Page 29: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Experiments

500MB INEX dataset

Varying parameters Size of data, # keywords, selectivity of keywords # of joins, join selectivity, level of nesting # of results, Avg. size of view element

Four alternative approaches Baseline GTP : general solution to integrate structure and keyword

search queries Efficient : proposed architecture Proj : techniques of projecting XML documents

29

Page 30: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Experiments

EFFICIENT is a scalable and efficient soultion

30

The cost of generating PDTs scales gracefully

Overhead of post-processing(scoring and ma-terializing) is negligible

The cost of the query evalua-tor dominates the entire cost

Page 31: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Experiments

Run time for EFFICIENT in-creases slightly Because it accesses more

inverted lists to retrieve tf values

31

Run time for EFFICIENT in-creases Because the cost of the

query evaluation increases

Page 32: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Index

Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work

32

Page 33: Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Copyright 2007 by CEBT

Conclusion and Future Work

Conclusion A general technique for evaluating keyword search queries

over views Efficient over a wide range of parameters

Future Work Instead of using the regular query evaluator, we could use

the techniques proposed for ranked query evaluation Views may contain non-monotonic operators such as group-

by

33