Reverse Spatial and Textual k Nearest Neighbor Search Jiaheng Lu Renmin University of China Sep 6 2011 Presentation in HP Labs China.

Reverse Spatial and Textual k Nearest Neighbor Search

Jiaheng Lu

Renmin University of China

Sep 6 2011

Presentation in HP Labs China

Research experience Associate Professor: Renmin University of China

XML data management, Spatial data management, Cloud data management

Post-doc: University of California, Irvine Data integration, Approximate string match

PhD National University of Singapore XML data management

Outline

XML data management XML twig query processing XML keyword search

Approximate string matching Reverse Spatial and Textual k Nearest Neighbor

Search (SIGMOD 2011)

XML twig query processing

XPath: Section[Title]/Paragraph//Figure Twig pattern

Section

Title Paragraph

Figure

XML twig query processing (Cont.) Problem Statement

Given a query twig pattern Q, and an XML database D, we need to compute ALL the answers to Q in D.

E.g. Consider Query and Document:

Document: s1

s2

f1

p1

t1

t2

Section

title figure

Query solutions: (s1, t1, f1) (s2, t2, f1) (s1, t2, f1)

Query:

An example for TJFast algorithmDocument: Query:

A

D B

C

a1

a2 a3 b2

d2 b1

c2

d3

c1

d1

0.0

0.0.1

0.3

0.3.1

0.3.2

0.3.2.1

0.5

0.5.0.0

0.3.2.1, 0.5.0.0

0.0.1 , 0.3.1, 0.5.0TD:

TC:

Root0

…

0.5.0

A set for the branching node A

{ }

XML twig query processing (Cont.)

Several efficient pattern matching algorithms TJFast (VLDB 05) iTwigJoin (SIGMOD 05) TwigStackList (CIKM 04) TreeMatch (TKDE 10)

Current works: distributed XML twig pattern processing

XML twig query processing Jiaheng Lu, Ting Chen, Tok Wang Ling: Efficient processing of XML twig patterns with

parent child edges: a look-ahead approach. CIKM 2004:533-542 Jiaheng Lu, Tok Wang Ling, Chee Yong Chan, Ting Chen: From Region Encoding To

Extended Dewey: On Efficient Processing of XML Twig Pattern Matching. VLDB 2005:193-204

Jiaheng Lu, Tok Wang Ling: Labeling and Querying Dynamic XML Trees. APWeb 2004:180-189

Jiaheng Lu, Ting Chen, Tok Wang Ling: TJFast: effective processing of XML twig pattern matching. WWW (Special interest tracks and posters) 2005:1118-1119

Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni: Efficient Processing of Ordered XML Twig Pattern. DEXA 2005:300-309

Jiaheng Lu: Benchmarking Holistic Approaches to XML Tree Pattern Query Processing - (Extended Abstract of Invited Talk). DASFAA Workshops 2010:170-178

Tian Yu, Tok Wang Ling, Jiaheng Lu: TwigStackList-: A Holistic Twig Join Algorithm for Twig Query with Not-Predicates on XML Data. DASFAA 2006:249-263

Zhifeng Bao, Tok Wang Ling, Jiaheng Lu, Bo Chen: SemanticTwig: A Semantic Approach to Optimize XML Query Processing. DASFAA 2008:282-298

Ting Chen, Jiaheng Lu, Tok Wang Ling: On Boosting Holism in XML Twig Pattern Matching using Structural Indexing Techniques. SIGMOD 2005:455-466

……

课题背景： XQuery vs. 关键字查询

XQuery: for $a in doc(“bib.xml”)//author $n in $a/name where $n=”Mike” return $a//inproceedings

Query papers by “Mike”

Keyword search:

Mike ， inproceedings

Complicated

XML keyword search

The proposed keyword search returns the set of smallest trees containing all keywords.

bib

author author

name publications hobby

title

inproceedings articles

year

Mikeward

Paperfolding

title year

Base line of XML key

Information Retrival

20022002

name publications hobby

title

inproceedings article

year

JohnHopking Read

book

title year

Data Mining

KeywordSearch

in XML

20092007

Keywords:

Mike hobby

article 2009

Paper

XML keyword search

Effectiveness

Capture user’s search intentionIdentify the target that users intend to search forInfer the predicate constraint that user intends to search via

Result rankingRank the query results according to their objective

relevance to user search intention

Zhifeng Bao, Jiaheng Lu, Tok Wang Ling: XReal: an interactive XML keyword searching. CIKM 2010:1933-1934

Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Liang Xu, Huayu Wu: An Effective Object-Level XML Keyword Search. DASFAA 2010:93-109

Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Bo Chen: Towards an Effective XML Keyword Search. TKDE, 22(8):1077-1092 (2010)

Zhifeng Bao, Bo Chen, Tok Wang Ling, Jiaheng Lu: Demonstrating Effective Ranked XML Keyword Search with Meaningful Result Display. DASFAA 2009:750-754

Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu: Effective XML Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528

Bo Chen, Jiaheng Lu, Tok Wang Ling: Exploiting ID References for Effective Keyword Search in XML Documents. DASFAA 2008:529-537

Jianjun Xu, Jiaheng Lu, Wei Wang, Baile Shi: Effective Keyword Search in XML Documents Based on MIU. DASFAA 2006:702-716

……

XML keyword search

Outline



Search

Motivation: Data Cleaning

Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008

Real-world data is dirty

Typos

Inconsistent representations

(PO Box vs. P.O. Box)

Approximately check against

clean dictionary

Should clearly be “Niels Bohr”

Motivation: Record Linkage

Name Hobbies AddressBrad Pitt … …Forest Whittacker … …George Bush … …Angelina Jolie … …Arnold Schwarzenegger … …

Phone Age Name… … Brad Pitt… … Arnold Schwarzeneger … … George Bush… … Angelina Jolie … … Forrest Whittaker

We want to link records belonging to the same entity

No exact match!

The same entity may have similar representations

Arnold Schwarzeneger versusArnold Schwarzenegger

Forrest Whittaker versusForest Whittacker

Motivation: Query Relaxation

http://www.google.com/jobs/britney.html

Errors in queries

Errors in data

Bring query and meaningful

results closer together

Actual queries gathered by Google

What is Approximate String Search?

String Collection: (People)

Brad PittForest WhittackerGeorge BushAngelina JolieArnold Schwarzeneger………

Queries against collection:Find all entries similar to “Forrest Whitaker”Find all entries similar to “Arnold Schwarzenegger”Find all entries similar to “Brittany Spears”

What do we mean by similar to?- Edit Distance- Jaccard Similarity- Cosine Similaity- Dice- Etc.

The similar to predicate can help our described applications!

How can we support these types of queries efficiently?

Approximate Query Answering

Main Idea: Use q-grams as signatures for a string

irvine

2-grams {ir, rv, vi, in, ne}

Intuition: Similar strings share a certain number of grams

Inverted index on grams supports finding all data strings sharing enough grams with a query

Sliding Window

Approximate Query Example

Query: “irvine”, Edit Distance 1 2-grams {ir, rv, vi, in, ne}

tf vi ir ef rv ne unin ……

Lookup Grams

2-grams134579

59

15

1239

39

79

569

Inverted Lists

(stringIDs)

12456

Each edit operations can “destroy” at most q gramsAnswers must share at least T = 5 – 1 * 2 = 3 grams

T-Occurrence problem: Find elements occurring at least T=3 times among inverted lists. This is called list-merging. T is called merging-threshold.

Candidates = {1, 5, 9}May have false positivesNeed to compute real similarity

Approximate string matching Jiaheng Lu, Jialong Han, Xiaofeng Meng: Efficient algorithms for

approximate member extraction using signature-based inverted lists. CIKM 2009:315-324

Alexander Behm, Shengyue Ji, Chen Li, Jiaheng Lu: Space-Constrained Gram-Based Indexing for Efficient Approximate String Search. ICDE 2009:604-615

Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE 2008:257-266

Yuanzhe Cai, Gao Cong, Xu Jia, Hongyan Liu, Jun He, Jiaheng Lu, Xiaoyong Du: Efficient Algorithm for Computing Link-Based Similarity in Real World Networks. ICDM 2009:734-739

……

Outline



Search (SIGMOD 2011)

If add a new shop at Q, which shops will be influenced?

Influence facts Spatial Distance

Results: D, F Textual Similarity

Services/Products...Results: F, C

Motivation

food

clothes

sports

food

clothes

clothes

clothes

2

Problems of finding Influential Sets

Traditional queryReverse k nearest neighbor query (RkNN)

Our new queryReverse spatial and textual k nearest neighbor query (RSTkNN)

3

Problem Statement

Spatial-Textual Similarity• describe the similarity between such objects based o

n both spatial proximity and textual similarity.

Spatial-Textual Similarity Function

4

Problem Statement (con’t)

RSTkNN query finding objects which have the query

object as one of their k spatial-textual similar objects.

5

Related Work• Pre-computing the kNN for each object

(Korn ect, SIGMOD2000, Yang ect, ICDE2001)

• (Hyper) Voronio cell/planes pruning strategy(Tao ect, VLDB2004, Wu ect, PVLDB2008, Kriegel ect, ICDE2009)

• 60-degree-pruning method(Stanoi ect, SIGMOD2000)

• Branch and Bound (based on Lp-norm metric space)(Achtert ect, SIGMOD2006, Achtert ect, EDBT2009)

• Pre-computing the kNN for each object(Korn ect, SIGMOD2000, Yang ect, ICDE2001)

• (Hyper) Voronio cell/planes pruning strategy(Tao ect, VLDB2004, Wu ect, PVLDB2008, Kriegel ect, ICDE2009)

• 60-degree-pruning method(Stanoi ect, SIGMOD2000)

• Branch and Bound (based on Lp-norm metric space)(Achtert ect, SIGMOD2006, Achtert ect, EDBT2009)

7

Challenging Features:

• Lose Euclidean geometric properties.

• High dimension in text space.

• k and α are different from query to query.

Challenging Features:

• Lose Euclidean geometric properties.

• High dimension in text space.

• k and α are different from query to query.

N1

N3

N2

N4

y

x

p4

p2

p1

p5

q(0.5, 2.5)

p3

ObjVct2

[0, 1][0, 1]

ObjVct3

[1, 0][1, 0]

[4,4][4,4]

p2 p3

IntUniVct11

[4,4][4,4]

p1ObjVct1

N1 N2

N4

ObjVct4

[3, 2.5][3, 2.5]

ObjVct5

[3.5, 1.5][3.5, 1.5]

p4 p5

N3

[0,0][1,1]

2

[3,1.5][3.5,2.5]

2

x

ObjVct1 1 1

ObjVct2 1 1

ObjVct3 5 5

ObjVct4 8 8

ObjVct5 1 1

4

1

0

2.5

1.5

4

0

1

3

3.5

p1

p2

p3

p4

p5

q 0.5 2.5 ObjVctQ 8 8

vectorsy

UniVct1 1 1

UniVct2 5 5

UniVct3 8 8

IntVct1 1 1

IntVct2 1 1

IntVct3 1 1

IntUniVct2

IntUniVct3

Intersection and Union R-tree (IUR-tree)

10

Overview of Search Algorithm

RSTkNN Algorithm: Travel from the IUR-tree root Progressively update lower and upper bounds Apply search strategy:

prune unrelated entries in Pruned; report entries to be results Ans; add candidate objects to Cnd.

FinalVerification For objects in Cnd, check whether results or not by

updating the bounds for candidates using expanding entries in Pruned.

14

N4

N1p1

N2p2 p3

N3p4 p5

EnQueue(U, N4);

Initialize N4.CLs;

N1

N3

N2

N4

y

x

p4

p2

p1

p5

q(0.5, 2.5)

p3

x

ObjVct1 1 1

ObjVct2 1 1

ObjVct3 5 5

ObjVct4 8 8

ObjVct5 1 1

4

1

0

2.5

1.5

4

0

1

3

3.5

p1

p2

p3

p4

p5

q 0.5 2.5 ObjVctQ 8 8

vectorsy

Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6

U

N4, (0, 0)15


U N4(0, 0)

DeQueue(U, N4) Mutual-effectN1 N2

N1 N3

N2 N3

N1

N3

N2

N4

y

x

p4

p2

p1

p5

q(0.5, 2.5)

p3

x

ObjVct1 1 1

ObjVct2 1 1

ObjVct3 5 5

ObjVct4 8 8

ObjVct5 1 1

4

1

0

2.5

1.5

4

0

1

3

3.5

p1

p2

p3

p4

p5

q 0.5 2.5 ObjVctQ 8 8

vectorsy

N4

N1p1

N2p2 p3

N3p4 p5

EnQueue(U, N2)EnQueue(U, N3)Pruned.add(N1)

Pruned N1(0.37, 0.432)

N3(0.323, 0.619 ) N2(0.21, 0.619 )

16


U

DeQueue(U, N3) Mutual-effectp4 N2

p5 p4,N2

N1

N3

N2

N4

y

x

p4

p2

p1

p5

q(0.5, 2.5)

p3

x

ObjVct1 1 1

ObjVct2 1 1

ObjVct3 5 5

ObjVct4 8 8

ObjVct5 1 1

4

1

0

2.5

1.5

4

0

1

3

3.5

p1

p2

p3

p4

p5

q 0.5 2.5 ObjVctQ 8 8

vectorsy

Answer.add(p4)Candidate.add(p5)

Pruned N1(0.37, 0.432)

N3(0.323, 0.619 ) N2(0.21, 0.619 )

Answer

Candidate

p4(0.21, 0.619 )

p5(0.374, 0.374)

N4

N1

p1

N2

p2 p3

N3

p4 p5

17


U

DeQueue(U, N2) Mutual-effectp2 p4,p5

p3 p2,p4,p5

N1

N3

N2

N4

y

x

p4

p2

p1

p5

q(0.5, 2.5)

p3

x

ObjVct1 1 1

ObjVct2 1 1

ObjVct3 5 5

ObjVct4 8 8

ObjVct5 1 1

4

1

0

2.5

1.5

4

0

1

3

3.5

p1

p2

p3

p4

p5

q 0.5 2.5 ObjVctQ 8 8

vectorsy

Answer.add(p2, p3)

Pruned.add(p5)

Pruned N1(0.37, 0.432)

N2(0.21, 0.619 )

Answer

Candidate

p4

p5(0.374, 0.374)

N4

N1

p1

N2

p2 p3

N3

p4 p5

p2 p3

So far since U=Cand=empty, algorithm ends.

Results: p2, p3, p4.

So far since U=Cand=empty, algorithm ends.

Results: p2, p3, p4.

18

Cluster IUR-tree: CIUR-tree

IUR-tree: Texts in an index node could be very different.

CIUR-tree: An enhanced IUR-tree by incorporating textual clusters. N1

N3

N2

N4

y

x

p4

p2

p1

p5

q(0.5, 2.5)

p3

ObjVct2

[0, 1][0, 1]

ObjVct3

[1, 0][1, 0]

[4,4][4,4]

p2 p3

IntUniVct1 1

[4,4][4,4]

p1ObjVct1

N1 N2

N4

ObjVct4

[3, 2.5][3, 2.5]

ObjVct5

[3.5, 1.5][3.5, 1.5]

p4 p5

N3

[0,0][1,1]

2

[3,1.5][3.5,2.5]

2

x

ObjVct1 1 1

ObjVct2 5 5

ObjVct3 5 5

ObjVct4 8 8

ObjVct5 1 1

4

1

0

2.5

1.5

4

0

1

3

3.5

p1

p2

p3

p4

p5

q 0.5 2.5 ObjVctQ 8 8

vectors word1

word2

y

word

2

word

1

UniVct1 1 1

UniVct2 5 5

UniVct3 8 8

word

2

word

1

IntVct1 1 1

IntVct2 1 1

IntVct3 1 1

IntUniVct2

IntUniVct3

C1:1

C2:2

C1:1, C3:1

C1

C2

C2

C3

C1

cluster

19

Optimizations

Motivation To give a tighter bound during CIUR-tree traversal To purify the textual description in the index node

Outlier Detection and Extraction (ODE-CIUR) Extract subtrees with outlier clusters Take the outliers into special account and calculate their

bounds separately.

Text-entropy based optimization (TE-CIUR) Define TextEntropy to depict the distribution of text

clusters in an entry of CIUR-tree Travel first for the entries with higher TextEntropy, i.e.

more diverse in texts.20

Experimental Study

Experimental Setup OS: Windows XP; CPU: 2.0GHz; Memory: 4GB Page size: 4KB; Language: C/C++.

Compared Methods baseline, IUR-tree, ODE-CIUR, TE-CIUR, and ODE-TE.

Datasets ShopBranches(Shop), extended from a small real data GeographicNames(GN), real data CaliforniaDBpedia(CD), generated combining location in California and documents

from DBpedia. Metric

Total query time Page access number

Statistics Shop CD GN

Total # of objects 304,008 1,555,209 1,868,821

Total unique words in dataset 3933 21,578 222,409

Average # words per object 45 47 4

21

Scalability

0.1

1

10

100

1000

10000

100000

1000000

10000000

50K 300K 550K 800K 1050K

dataset size

quer

y tim

e (s

ec)

baseline IUR-Tree

ODE-CIUR TE-CIUR

ODE-TE

0

2

4

6

8

50K 300K 550K 800K 1050K

dataset size

qu

ery

tim

e (s

ec)

baseline IUR-Tree

ODE-CIUR TE-CIUR

ODE-TE

0.2K 3K 40K 550K 4M

(1) Log-scale version (2) Linear-scale version

22

Effect of k

0

1

2

3

4

1 3 5 7 9

k

quer

y ti

me

(sec

)

IUR-Tree ODE-CIUR TE-CIUR ODE-TE

Query time

23

Conclusion

Propose a new query problem RSTkNN. Present a hybrid index IUR-Tree. Show the enhanced variant CIUR-Tree

and two optimizations ODE-CIUR and TE-CIUR to further improve search processing.

24

Current and future works

Distributed XML query processing

Cloud-based SQL Processing

Spatial and Temporal Keyword search

Thank youQ&A

Reverse Spatial and Textual k Nearest Neighbor Search Jiaheng Lu Renmin University of China Sep 6 2011 Presentation in HP Labs China.

Documents

xml query processing

effective xml keyword

xml documents

interactive xml keyword

effective keyword search

tok wang ling

mike keyword search

neighbor search slide