Top Banner
XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November 22 2010 Presentation in TeleCom ParisTech
65

XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Mar 27, 2015

Download

Documents

Autumn Alvarez
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

XML data management and approximate string matching

Jiaheng Lu

Key Lab of Data Engineering and Knowledge Engineering

Renmin University of China

November 22 2010

Presentation in TeleCom ParisTech

Page 2: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Research experience Associate Professor: Renmin University of China

XML data management, Cloud data management, Approximate search

Post-doc: University of California, Irvine Data integration, Approximate string match

PhD National University of Singapore XML data management

Page 3: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Outline

XML data management XML twig query processing XML keyword search Graphical and interactive XML query processing

Approximate string matching Approximate string search Approximate member extraction

Page 4: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

XML twig query processing

XPath: Section[Title]/Paragraph//Figure Twig pattern

Section

Title Paragraph

Figure

Page 5: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

XML twig query processing (Cont.) Problem Statement

Given a query twig pattern Q, and an XML database D, we need to compute ALL the answers to Q in D.

E.g. Consider Query and Document:

Document: s1

s2

f1

p1

t1

t2

Section

title figure

Query solutions: (s1, t1, f1) (s2, t2, f1) (s1, t2, f1)

Query:

Page 6: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Previous work: TwigStack

TwigStack [1] is a holistic algorithm for XML twig matching on containment labeling scheme. Two steps in TwigStack :

(1) intermediate path solutions are output to match each query root-to-leaf path; and

(2) these intermediate path solutions are merged to get the final results.

[1] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In Proceedings of ACM SIGMOD, 2002.

Page 7: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Running example: TwigStack algorithm

s

t f

Query:

s (1,12,1)

t

f

(2,3,2)

(8,9,4)

Data streams:

(5,6,3)

(4,11,2)

State of stacks:

Output path intermediate solutions:

(1,12,1) (2,3,2)

s//t:

(1,12,1) (5,6,3)(4,11,2) (5,6,3)

s//f:

(1,12,1) (8,9,4)(4,11,2) (8,9,4)

Final results:

(1,12,1) (2,3,2) (8,9,4)(1,12,1) (5,6,3) (8,9,4)(4,11,2) (5,6,3) (8,9,4)

(1,12,1) (4,11,2)

(2,3,2) (5,6,3)

(8,9,4)

Page 8: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Limitations of TwigStack

(1) TwigStack may output many useless intermediate results for queries with parent-child relationship

(2) TwigStack cannot process XML twig queries with ordered predicates, like “Proceeding”, “Following” in XPath

(3) TwigStack cannot answer queries with wildcards in branching nodes.

E.g. *

B C

The parent of B should be an ancestor of C

Page 9: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

XML twig query processing (Cont.)

Several efficient pattern matching algorithms TJFast (VLDB 05)(citation: 173) iTwigJoin (SIGMOD 05) TwigStackList (CIKM 04) TreeMatch (TKDE 10)

Page 10: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Motivation: new labeling scheme

TwigStackList and iTwigJoin are all based on the containment labeling scheme

Why not try Dewey labeling scheme for XML

twig pattern query ?

Oh, it is really a novel idea!

Page 11: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Original Dewey Labeling Scheme

In Dewey labeling scheme, each element is presented by an integer sequence: (i) the root is labeled by a empty stringε (ii) for a non-root element u, label(u)= label(s).x, where

u is the x-th child of s.

For example:

s1

s2

f1

f2t1

t2

1 2 3

2.1 2.2

ε

Page 12: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Main problem of the original Dewey

If we use the original Dewey labeling scheme to answer the twig query, we need to read labels for all query node. Thus, this is not a better solution than pervious algorithms.

Extend the original Dewey labeling scheme so that given the label of any element e, we can know the path of e from this label alone

Page 13: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Modular function We need to know some schema information: DTD (Document

Type Definitions ) or XML schema Given DTD information: book → author, title, chapter* Our solution: using modular function, we create a match between

an element tag and an integer number. We define Xauthormod 3 = 0 Xtitlemod 3 = 1 Xchaptermod 3 = 2;

where, Xt is the last integer of the label of tag t.

bookε

0

titleauthor 1

chapter2

chapter

5

Why not 3 as the original Dewey ?

The number of distinct tags under book

Page 14: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Derive element tag

From a label , we can derive its tag name. book → author, title, chapter* Recall that we define: Xauthor mod 3 = 0 Xtitle mod

3 = 1 Xchapter mod 3 = 2.book

ε

0

titleauthor 1

chapter2

chapter

5

? ? ? ?

Page 15: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

More examples for assigning labels Let us consider a more complicated DTD a → (b | c )*, d?, c+ We define: Xbmod 3 = 0 Xcmod 3 = 1 Xd mod 3 = 2

(Why do we use mod 3 instead of 4?)

0

db

2c4

c

7

Page 16: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Derive the path from a label

By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label.

For example:DTD:

book → author, title, chapter*

chapter → (paragraph | section)*

section → (paragraph | section)*

book

chapter

sectionauthor title

book

author

title

chapter

paragraph

section

Mod 3=0

Mod 3=1

Mod 3=2 Mod 2=0

Mod 2=1

Mod 2=0

Mod 2=1

Question: Given a label 5.1.0, what is the corresponding path ?

Document:

FST:

chapter

section

paragraphsection

Page 17: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Derive the path from a label By following a finite state transducer (FST), we may recursively derive

the whole path from any extended Dewey label. For example:DTD:

book → author, title, chapter*

chapter → (paragraph | section)*

section → (paragraph | section)*

book

chapter

sectionauthor title

Document:chapter

section

paragraph section

Following the above red path, we get

5.1.0 denotes :

book/ chapter/section/paragraph

book

author

title

chapter

paragraph

section

Mod 3=0

Mod 3=1

Mod 3=2 Mod 2=0

Mod 2=1

Mod 2=0

FST:

Mod 2=1

Page 18: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Two properties of extended Dewey Find Ancestor Label

From a label of any element, we can derive the labels of its all ancestors.

Find Ancestor Name From a label of any element, we can derive the tag

names of its all ancestors.

Two properties enable us to design a new and efficient algorithm for XML twig pattern matching.

Page 19: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

A new algorithm: TJFast For each node n in the query, there exists a corresponding

input stream Tn.

Tn contains the extended Dewey labels of elements of tag n. Those labels are arranged by the document order.

For each branching node b of twig pattern, there is a corresponding set Sb, which contains elements possibly involving query answers. (Compared to TwigStackList, what difference? )

During any point of computing, the size of set Sb is bounded by the depth of the XML document.

Page 20: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

An example for TJFast algorithmDocument: Query:

A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0TD:

TC:

DTD:

a -> a*,d*, b*

b -> d*, c*

d -> c*

Root0

0.5.0

A set for the branching node A

Why are there only two streams?

{ }

Page 21: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

An example for TJFast algorithmDocument:

Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

Root0

0.5.0

0.0.1 a1/a2/d1derive

0.3.2.1 a1/a3/b1/c1derive

By finite state transducer of extended Dewey labeling scheme

TD:

TC:

{ }

Page 22: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

An example for TJFast algorithm

Document: Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

Root0

0.5.0

Both a1 and a3 possibly involve in query answers. (Why not a2 ?)TD:

TC:

{ }

Page 23: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Document:Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

Root0

0.5.0

Then we insert a1, a3 to the set,

Output Path solutions:

A//D A/B//C

(a1, d1) (a3, b1, c1)

TD:

TC:

An example for TJFast algorithm{a1,a3}

Page 24: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Document:Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

Root0

0.5.0Move the cursor of TD from d1 to d2

TD:

TC:

An example for TJFast algorithm

Output Path solutions:A//D A/B//C(a1, d1) (a3, b1, c1)(a1, d2)(a3, d2)

{a1,a3}

Page 25: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Document:Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

Root0

0.5.0

Move the cursor of stream TD from

d2 to d3

TD:

TC:

An example for TJFast algorithm

Output Path solutions:A//D A/B//C(a1, d1) (a3, b1, c1)(a1, d2)(a3, d2)(a1, d3)

{a1,a3}

Page 26: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Document:Query: A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0

Root0

0.5.0

Move the cursor of stream TC from c1 to c2

TD:

TC:

An example for TJFast algorithm

Output Path solutions:A//D A/B//C(a1, d1) (a3, b1, c1)(a1, d2) (a1, b2, c2)(a3, d2)(a1, d3)

{a1,a3}

Page 27: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Document:

Query:A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

A// D:<a1, d1>, <a1, d2>,<a1, d3>,<a3, d2>

A/B//C:<a1,b2, c2>,<a3, b1,c1>

Phase 1. Intermediate paths

<a1,d1,b2,c2>,<a1,d2, b2,c2>,

<a1,d3,b2,c2>,<a3,d2, b1,c1>,

<A, D, B,C>

Phase 2. Final solutions

Join

Sort and merge-join in TJFast

Page 28: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

TJFast+L

Apply extended Dewey labeling scheme on tag+level streaming scheme, we propose TJFast+L algorithm by extending TJFast

Two benefits of TJFast+L over TJFast reduce I/O cost by reading less elements enlarge optimal query classes

Page 29: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Optimal query classes

Only P-C in all edges

A

B C C

A

B

D D

Optimal Class of TJFast

Optimal Class of TJFast+L

Only A-D in branching edges

Page 30: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

XML twig query processing Jiaheng Lu, Ting Chen, Tok Wang Ling: Efficient processing of XML twig patterns with

parent child edges: a look-ahead approach. CIKM 2004:533-542 Jiaheng Lu, Tok Wang Ling, Chee Yong Chan, Ting Chen: From Region Encoding To

Extended Dewey: On Efficient Processing of XML Twig Pattern Matching. VLDB 2005:193-204

Jiaheng Lu, Tok Wang Ling: Labeling and Querying Dynamic XML Trees. APWeb 2004:180-189

Jiaheng Lu, Ting Chen, Tok Wang Ling: TJFast: effective processing of XML twig pattern matching. WWW (Special interest tracks and posters) 2005:1118-1119

Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni: Efficient Processing of Ordered XML Twig Pattern. DEXA 2005:300-309

Jiaheng Lu: Benchmarking Holistic Approaches to XML Tree Pattern Query Processing - (Extended Abstract of Invited Talk). DASFAA Workshops 2010:170-178

Tian Yu, Tok Wang Ling, Jiaheng Lu: TwigStackList-: A Holistic Twig Join Algorithm for Twig Query with Not-Predicates on XML Data. DASFAA 2006:249-263

Zhifeng Bao, Tok Wang Ling, Jiaheng Lu, Bo Chen: SemanticTwig: A Semantic Approach to Optimize XML Query Processing. DASFAA 2008:282-298

Ting Chen, Jiaheng Lu, Tok Wang Ling: On Boosting Holism in XML Twig Pattern Matching using Structural Indexing Techniques. SIGMOD 2005:455-466

……

Page 31: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Outline

XML data management XML twig query processing XML keyword search Graphical and interactive XML query processing

Page 32: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

课题背景: XQuery vs. 关键字查询

XQuery: for $a in doc(“bib.xml”)//author $n in $a/name where $n=”Mike” return $a//inproceedings

Query papers by “Mike”

Keyword search:

Mike , inproceedings

Complicated

Page 33: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

The proposed keyword search returns the set of smallest trees containing all keywords.

bib

author author

name publications hobby

title

inproceedings articles

year

Mikeward

Paperfolding

title year

Base line of XML key

Information Retrival

20022002

name publications hobby

title

inproceedings article

year

JohnHopking Read

book

title year

Data Mining

KeywordSearch

in XML

20092007

Keywords:

Mike hobby

article 2009

Paper

Page 34: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

XML keyword search

– Search intention identification

– Query result retrieval

– Result ranking– Extend original TF*IDF from text database to XML database,

while capture the hierarchical structure of XML data

– Detailed papers: Effective XML Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528

(one of best papers to be invited in TKDE Journal)

Page 35: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

XML keyword search

XML Keyword search Inspired by IR style keyword search on the web Enables user to access information in XML

database XML data modeled as a rooted, labeled tree Recent research efforts

Efficiency Effectiveness

Page 36: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Effectiveness

Capture user’s search intentionIdentify the target that user intends to search forInfer the predicate constraint that user intends to search via

Result rankingRank the query results according to their objective

relevance to user search intention

Page 37: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

State of the Art Search semantics design

LCA (Lowest Common Ancestor) Node v is a LCA of keyword set K={w1, w2,…,wk} if the sub-tree

rooted at v contains at least one occurrence of all keywords in K, after excluding the sub-elements that already contain all keywords in K

SLCA (Smallest LCA) Node v is a SLCA of keyword set K={w1, w2,…,wk} if

(1) v is a LCA of K (2) no proper descendant of v is LCA of K

XSeek Infers the search intention based on the concept of objects and an

analysis of the matching between keyword and data node

Page 38: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

State of the Art (cont)

Efficient result retrieval Designed based on a certain search semantics XKSearch, Multiway SLCA etc.

Result ranking XRANK, XKSEarch, EASE They only consider

Structural compactness of matching results Keyword proximity Similarity at node level

Page 39: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Problems Unaddressed

Not address the user search intention adequately! Meaningfulness of query result

SLCA is less meaningful in many cases Keyword Ambiguity Problems

1. A keyword can appear both as an xml node type and as the text value of some other nodes

2. A keyword can appear in the text values of different xml node types and carry different meanings

Neither SLCA nor Xseek can well address keyword ambiguity

Page 40: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Problems——Keyword AmbiguityQ = “customer, interest, art”

Ambiguity 1: customer, interest; Ambiguity 2: art Intention: find customer whose interest is art less relevant or irrelevant result to be returned also --- C1,C3, B1’s title

customers

storeDB

books

... ...book

title publisherIDauthors

author“B 2 ”

...

“Edward Martin”

“Sophia Jones”

author

customer

IDname

interest

interests...

“art”“Rock Davis”

“C 4 ”

...

“Daniel Jones”“John Williams”

book

title...

IDauthors

author“B 1 ”

author

“Art of Customer Interest Care”

customer

IDname

addressinterest

streetcity

interestscontact

no.

“1”

“Art Street”...

...

“fashion”“Mary Smith”

“C 1 ”

customer

IDname

interest

interests

“rock music”“Art Smith”

“C 3 ”

purchase

purchases

customer

ID name

interest

interests

“street art”“John Martin”

“C 2 ”

...

......

...name

“Oxford”

Page 41: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Problems——Keyword Ambiguity (cont)

Q = “customer, interest, art” “art” can be the value of interest node(C2, C4), name node(C3), or street

node of customer(C1), or title node of book(B1) “customer” can be tag name of customer node, or (part of) value of title

of(B1) - How to rank C1 to C4 and B1?customers

storeDB

books

... ...book

title publisherIDauthors

author“B 2 ”

...

“Edward Martin”

“Sophia Jones”

author

customer

IDname

interest

interests...

“art”“Rock Davis”

“C 4 ”

...

“Daniel Jones”“John Williams”

book

title...

IDauthors

author“B 1 ”

author

“Art of Customer Interest Care”

customer

IDname

addressinterest

streetcity

interestscontact

no.

“1”

“Art Street”...

...

“fashion”“Mary Smith”

“C 1 ”

customer

IDname

interest

interests

“rock music”“Art Smith”

“C 3 ”

purchase

purchases

customer

ID name

interest

interests

“street art”“John Martin”

“C 2 ”

...

......

...name

“Oxford”

Page 42: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Objectives & Challenges

ChallengesI. How to decide which sub-tree(s) with appropriate node types can capture

user desired information

II. How to return sub-trees of an appropriate size (i.e. contain enough but non-overwhelming information)

III.How to rank those sub-trees by their relevance

• Address the below as a single problem – Search intention identification

– Query result retrieval

– Result ranking– Extend original TF*IDF from text database to XML database,

while capture the hierarchical structure of XML data

Page 43: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Challenges

Difficulty in applying TF*IDF to XMLXML DB carries semantic information while text DB contains

pure text information. XML TF*IDF must be aware of the underlying semantics.

All contents of XML data are stored in leaf nodes onlyWhat is analogy of “flat document” in XML?

o Sub-tree classified according to its prefix pathNormalization factor is not simply the size of sub-tree

o Structure of sub-trees may also infest the ranks

Page 44: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Our Approach Extend IR-style keyword search techniques (like TF*IDF) from text

database to XML database, in order to capture the hierarchical structure of xml document by analyzing the knowledge of statistics of underlying XML data

Major Contributions

1. Identify user’s desired search-for node and search-via node(s) in a heuristic way Define XML TF (term frequency) and XML DF (document frequency) Confidence Formulas for search for/via candidates

2. Define XML TF*IDF Similarity Propose 3 guidelines specifically for xml keyword search Take keyword ambiguity problems into account

3. Design a Keyword Search Engine XReal

Page 45: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Zhifeng Bao, Jiaheng Lu, Tok Wang Ling: XReal: an interactive XML keyword searching. CIKM 2010:1933-1934

Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Liang Xu, Huayu Wu: An Effective Object-Level XML Keyword Search. DASFAA 2010:93-109

Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Bo Chen: Towards an Effective XML Keyword Search. TKDE, 22(8):1077-1092 (2010)

Zhifeng Bao, Bo Chen, Tok Wang Ling, Jiaheng Lu: Demonstrating Effective Ranked XML Keyword Search with Meaningful Result Display. DASFAA 2009:750-754

Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu: Effective XML Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528

Bo Chen, Jiaheng Lu, Tok Wang Ling: Exploiting ID References for Effective Keyword Search in XML Documents. DASFAA 2008:529-537

Jianjun Xu, Jiaheng Lu, Wei Wang, Baile Shi: Effective Keyword Search in XML Documents Based on MIU. DASFAA 2006:702-716

……

XML keyword search

Page 46: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Outline

XML data management XML twig query processing XML keyword search Graphical and interactive XML query processing

Page 47: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Graphical and interactive XML search

Auto-completion XML search Order-sensitive XML twig query XML query suggestion

Demo online: http://datasearch.ruc.edu.cn:8080/LotusX/

Page 48: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Outline

XML data management XML twig query processing XML keyword search XML Keyword refinement Graphical and interactive XML query processing

Approximate string matching Approximate string search Approximate member extraction

Page 49: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Motivation: Data Cleaning

Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008

Real-world data is dirty

Typos

Inconsistent representations

(PO Box vs. P.O. Box)

Approximately check against

clean dictionary

Should clearly be “Niels Bohr”

Page 50: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Motivation: Record Linkage

Name Hobbies AddressBrad Pitt … …Forest Whittacker … …George Bush … …Angelina Jolie … …Arnold Schwarzenegger … …

Phone Age Name… … Brad Pitt… … Arnold Schwarzeneger … … George Bush… … Angelina Jolie … … Forrest Whittaker

We want to link records belonging to the same entity

No exact match!

The same entity may have similar representations

Arnold Schwarzeneger versusArnold Schwarzenegger

Forrest Whittaker versusForest Whittacker

Page 51: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Motivation: Query Relaxation

http://www.google.com/jobs/britney.html

Errors in queries

Errors in data

Bring query and meaningful

results closer together

Actual queries gathered by Google

Page 52: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

What is Approximate String Search?

String Collection: (People)

Brad PittForest WhittackerGeorge BushAngelina JolieArnold Schwarzeneger………

Queries against collection:Find all entries similar to “Forrest Whitaker”Find all entries similar to “Arnold Schwarzenegger”Find all entries similar to “Brittany Spears”

What do we mean by similar to?- Edit Distance- Jaccard Similarity- Cosine Similaity- Dice- Etc.

The similar to predicate can help our described applications!

How can we support these types of queries efficiently?

Page 53: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Approximate Query Answering

Main Idea: Use q-grams as signatures for a string

irvine

2-grams {ir, rv, vi, in, ne}

Intuition: Similar strings share a certain number of grams

Inverted index on grams supports finding all data strings sharing enough grams with a query

Sliding Window

Page 54: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Approximate Query Example

Query: “irvine”, Edit Distance 1 2-grams {ir, rv, vi, in, ne}

tf vi ir ef rv ne unin ……

Lookup Grams

2-grams134579

59

15

1239

39

79

569

Inverted Lists

(stringIDs)

12456

Each edit operations can “destroy” at most q gramsAnswers must share at least T = 5 – 1 * 2 = 3 grams

T-Occurrence problem: Find elements occurring at least T=3 times among inverted lists. This is called list-merging. T is called merging-threshold.

Candidates = {1, 5, 9}May have false positivesNeed to compute real similarity

Page 55: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Outline

XML data management XML twig query processing XML keyword search XML Keyword refinement Graphical and interactive XML query processing

Approximate string matching Approximate string search Approximate member extraction

Page 56: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Introduction: An Example

A dictionary of strings we are interested in E.g. product names, postal addresses…

We are going to locate their “approximate occurrence” in documents. See the meaning of “approximate occurrence” in

the following example:

Page 57: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Problem Definition

Given a dictionary R and a threshold δ, extract all proper substrings m from input documents S such that there exists r R, and Similarity (r, ∈m) ≥δ(or Distance(r, m) ≤k). Here we call r a piece of evidence for m. Similarity() is a function measuring the similarity

of two strings Strings are viewed as sets of tokens (words) An example for Sim(): Jaccard similarity:

)(

)(),(

mrwt

mrwtmrJ

Page 58: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Why pre-pruning is needed

We need evidence to decide whether a substring m should be extracted Simple verification on all dictionary strings may be

inefficient Pre-pruning and post-verifying is beneficial But should it be running-time-specific or filtering-

power-specific? Less time or less survivors?

Page 59: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

The issue of compromise comes again

Balance between the two stages should be reached:

More(less)filtration time

Strong(weak)filtration power

Fewer(more)candidates

Less(more)verification time

Overall performance

=Tf+Tv ?????

Page 60: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

State-of-the-art techniques ——K-signature scheme

K-signature scheme Proposed by Chakrabarti et al. (SIGMOD 2008) Choose several top-weighted tokens in a string as signatures to

represent it: s => Sig(s) Observation: if r cannot match m, r is likely to have insufficient

signature overlapping with m K is a parameter for filtration power tuning

Potential evidence loss A counter-example found when k=3 We tried and only proved that it works for k=1 and k=∞

Page 61: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Proposed by Chakrabarti et al. (SIGMOD 2008) Each dictionary string encoded into a solid 0-1 matrix An ‘1’ for each occurrence of a <token,sig-token> tuple

(‘1’- rectangle) Bitwise-or all solid matrices to get the matrix of R

Observation: if m is an approximate member of R, the matrix of m must have enough intersections with that of R.

Formalized into an NPC problem Solution causes too weak filtering power

State-of-the-art techniques ——Inverted Signature-based Hashtable

Page 62: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

If Sim(m,r) ≥δ, what do we have ?wt(Sig(m)∩Sig(r)) ≥ τ(m)

wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r) }

So the threshold does not remain constant involves unknown evidence

Our solution: Use inverted lists to count sig-token overlappings. Note that sig-tokens usually have low document

frequency (e.g. IDF as weights)

Our proposed theorem

Page 63: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Signature-based Inverted Lists (SLH) Lists indexed by sig-tokens Each sig-token of a string creates a node (containing the string’s id)

in the corresponding list. E.g. R = { r1 = “canon eos 5d digital camera", r2 =“nikon digital slr

camera”, r3=“canon slr camera”}. wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2, 7 ,9).

5d, 9.0

canon, 2.0

camera, 1.0

eos, 7.0

nikon, 2.0

slr, 2.0

5d, 9.0

canon, 2.0

camera, 1.0

eos, 7.0

nikon, 2.0

slr, 2.0

11

11

22

11

22

22

33

33

Our algorithms and evaluations ——EvSCAN:Filtration by SIL

Page 64: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Approximate string matching Jiaheng Lu, Jialong Han, Xiaofeng Meng: Efficient algorithms for

approximate member extraction using signature-based inverted lists. CIKM 2009:315-324

Alexander Behm, Shengyue Ji, Chen Li, Jiaheng Lu: Space-Constrained Gram-Based Indexing for Efficient Approximate String Search. ICDE 2009:604-615

Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE 2008:257-266

Yuanzhe Cai, Gao Cong, Xu Jia, Hongyan Liu, Jun He, Jiaheng Lu, Xiaoyong Du: Efficient Algorithm for Computing Link-Based Similarity in Real World Networks. ICDM 2009:734-739

……

Page 65: XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November.

Thank youQ&A