Reverse Spatial and Textu al k Nearest Neighbor Sear ch Jiaheng Lu Renmin University of China Sep 6 2011 Presentation in HP Labs China
Mar 27, 2015
Reverse Spatial and Textual k Nearest Neighbor Search
Jiaheng Lu
Renmin University of China
Sep 6 2011
Presentation in HP Labs China
Research experience Associate Professor: Renmin University of China
XML data management, Spatial data management, Cloud data management
Post-doc: University of California, Irvine Data integration, Approximate string match
PhD National University of Singapore XML data management
Outline
XML data management XML twig query processing XML keyword search
Approximate string matching Reverse Spatial and Textual k Nearest Neighbor
Search (SIGMOD 2011)
XML twig query processing
XPath: Section[Title]/Paragraph//Figure Twig pattern
Section
Title Paragraph
Figure
XML twig query processing (Cont.) Problem Statement
Given a query twig pattern Q, and an XML database D, we need to compute ALL the answers to Q in D.
E.g. Consider Query and Document:
Document: s1
s2
f1
p1
t1
t2
Section
title figure
Query solutions: (s1, t1, f1) (s2, t2, f1) (s1, t2, f1)
Query:
An example for TJFast algorithmDocument: Query:
A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0TD:
TC:
Root0
…
0.5.0
A set for the branching node A
{ }
XML twig query processing (Cont.)
Several efficient pattern matching algorithms TJFast (VLDB 05) iTwigJoin (SIGMOD 05) TwigStackList (CIKM 04) TreeMatch (TKDE 10)
Current works: distributed XML twig pattern processing
XML twig query processing Jiaheng Lu, Ting Chen, Tok Wang Ling: Efficient processing of XML twig patterns with
parent child edges: a look-ahead approach. CIKM 2004:533-542 Jiaheng Lu, Tok Wang Ling, Chee Yong Chan, Ting Chen: From Region Encoding To
Extended Dewey: On Efficient Processing of XML Twig Pattern Matching. VLDB 2005:193-204
Jiaheng Lu, Tok Wang Ling: Labeling and Querying Dynamic XML Trees. APWeb 2004:180-189
Jiaheng Lu, Ting Chen, Tok Wang Ling: TJFast: effective processing of XML twig pattern matching. WWW (Special interest tracks and posters) 2005:1118-1119
Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni: Efficient Processing of Ordered XML Twig Pattern. DEXA 2005:300-309
Jiaheng Lu: Benchmarking Holistic Approaches to XML Tree Pattern Query Processing - (Extended Abstract of Invited Talk). DASFAA Workshops 2010:170-178
Tian Yu, Tok Wang Ling, Jiaheng Lu: TwigStackList-: A Holistic Twig Join Algorithm for Twig Query with Not-Predicates on XML Data. DASFAA 2006:249-263
Zhifeng Bao, Tok Wang Ling, Jiaheng Lu, Bo Chen: SemanticTwig: A Semantic Approach to Optimize XML Query Processing. DASFAA 2008:282-298
Ting Chen, Jiaheng Lu, Tok Wang Ling: On Boosting Holism in XML Twig Pattern Matching using Structural Indexing Techniques. SIGMOD 2005:455-466
……
课题背景: XQuery vs. 关键字查询
XQuery: for $a in doc(“bib.xml”)//author $n in $a/name where $n=”Mike” return $a//inproceedings
Query papers by “Mike”
Keyword search:
Mike , inproceedings
Complicated
XML keyword search
The proposed keyword search returns the set of smallest trees containing all keywords.
bib
author author
name publications hobby
title
inproceedings articles
year
Mikeward
Paperfolding
title year
Base line of XML key
Information Retrival
20022002
name publications hobby
title
inproceedings article
year
JohnHopking Read
book
title year
Data Mining
KeywordSearch
in XML
20092007
Keywords:
Mike hobby
article 2009
Paper
XML keyword search
Effectiveness
Capture user’s search intentionIdentify the target that users intend to search forInfer the predicate constraint that user intends to search via
Result rankingRank the query results according to their objective
relevance to user search intention
Zhifeng Bao, Jiaheng Lu, Tok Wang Ling: XReal: an interactive XML keyword searching. CIKM 2010:1933-1934
Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Liang Xu, Huayu Wu: An Effective Object-Level XML Keyword Search. DASFAA 2010:93-109
Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Bo Chen: Towards an Effective XML Keyword Search. TKDE, 22(8):1077-1092 (2010)
Zhifeng Bao, Bo Chen, Tok Wang Ling, Jiaheng Lu: Demonstrating Effective Ranked XML Keyword Search with Meaningful Result Display. DASFAA 2009:750-754
Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu: Effective XML Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528
Bo Chen, Jiaheng Lu, Tok Wang Ling: Exploiting ID References for Effective Keyword Search in XML Documents. DASFAA 2008:529-537
Jianjun Xu, Jiaheng Lu, Wei Wang, Baile Shi: Effective Keyword Search in XML Documents Based on MIU. DASFAA 2006:702-716
……
XML keyword search
Outline
XML data management XML twig query processing XML keyword search
Approximate string matching Reverse Spatial and Textual k Nearest Neighbor
Search
Motivation: Data Cleaning
Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008
Real-world data is dirty
Typos
Inconsistent representations
(PO Box vs. P.O. Box)
Approximately check against
clean dictionary
Should clearly be “Niels Bohr”
Motivation: Record Linkage
Name Hobbies AddressBrad Pitt … …Forest Whittacker … …George Bush … …Angelina Jolie … …Arnold Schwarzenegger … …
Phone Age Name… … Brad Pitt… … Arnold Schwarzeneger … … George Bush… … Angelina Jolie … … Forrest Whittaker
We want to link records belonging to the same entity
No exact match!
The same entity may have similar representations
Arnold Schwarzeneger versusArnold Schwarzenegger
Forrest Whittaker versusForest Whittacker
Motivation: Query Relaxation
http://www.google.com/jobs/britney.html
Errors in queries
Errors in data
Bring query and meaningful
results closer together
Actual queries gathered by Google
What is Approximate String Search?
String Collection: (People)
Brad PittForest WhittackerGeorge BushAngelina JolieArnold Schwarzeneger………
Queries against collection:Find all entries similar to “Forrest Whitaker”Find all entries similar to “Arnold Schwarzenegger”Find all entries similar to “Brittany Spears”
What do we mean by similar to?- Edit Distance- Jaccard Similarity- Cosine Similaity- Dice- Etc.
The similar to predicate can help our described applications!
How can we support these types of queries efficiently?
Approximate Query Answering
Main Idea: Use q-grams as signatures for a string
irvine
2-grams {ir, rv, vi, in, ne}
Intuition: Similar strings share a certain number of grams
Inverted index on grams supports finding all data strings sharing enough grams with a query
Sliding Window
Approximate Query Example
Query: “irvine”, Edit Distance 1 2-grams {ir, rv, vi, in, ne}
tf vi ir ef rv ne unin ……
Lookup Grams
2-grams134579
59
15
1239
39
79
569
Inverted Lists
(stringIDs)
12456
Each edit operations can “destroy” at most q gramsAnswers must share at least T = 5 – 1 * 2 = 3 grams
T-Occurrence problem: Find elements occurring at least T=3 times among inverted lists. This is called list-merging. T is called merging-threshold.
Candidates = {1, 5, 9}May have false positivesNeed to compute real similarity
Approximate string matching Jiaheng Lu, Jialong Han, Xiaofeng Meng: Efficient algorithms for
approximate member extraction using signature-based inverted lists. CIKM 2009:315-324
Alexander Behm, Shengyue Ji, Chen Li, Jiaheng Lu: Space-Constrained Gram-Based Indexing for Efficient Approximate String Search. ICDE 2009:604-615
Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE 2008:257-266
Yuanzhe Cai, Gao Cong, Xu Jia, Hongyan Liu, Jun He, Jiaheng Lu, Xiaoyong Du: Efficient Algorithm for Computing Link-Based Similarity in Real World Networks. ICDM 2009:734-739
……
Outline
XML data management XML twig query processing XML keyword search
Approximate string matching Reverse Spatial and Textual k Nearest Neighbor
Search (SIGMOD 2011)
If add a new shop at Q, which shops will be influenced?
Influence facts Spatial Distance
Results: D, F Textual Similarity
Services/Products...Results: F, C
Motivation
food
clothes
sports
food
clothes
clothes
clothes
2
Problems of finding Influential Sets
Traditional queryReverse k nearest neighbor query (RkNN)
Our new queryReverse spatial and textual k nearest neighbor query (RSTkNN)
3
Problem Statement
Spatial-Textual Similarity• describe the similarity between such objects based o
n both spatial proximity and textual similarity.
Spatial-Textual Similarity Function
4
Problem Statement (con’t)
RSTkNN query finding objects which have the query
object as one of their k spatial-textual similar objects.
5
Related Work• Pre-computing the kNN for each object
(Korn ect, SIGMOD2000, Yang ect, ICDE2001)
• (Hyper) Voronio cell/planes pruning strategy(Tao ect, VLDB2004, Wu ect, PVLDB2008, Kriegel ect, ICDE2009)
• 60-degree-pruning method(Stanoi ect, SIGMOD2000)
• Branch and Bound (based on Lp-norm metric space)(Achtert ect, SIGMOD2006, Achtert ect, EDBT2009)
• Pre-computing the kNN for each object(Korn ect, SIGMOD2000, Yang ect, ICDE2001)
• (Hyper) Voronio cell/planes pruning strategy(Tao ect, VLDB2004, Wu ect, PVLDB2008, Kriegel ect, ICDE2009)
• 60-degree-pruning method(Stanoi ect, SIGMOD2000)
• Branch and Bound (based on Lp-norm metric space)(Achtert ect, SIGMOD2006, Achtert ect, EDBT2009)
7
Challenging Features:
• Lose Euclidean geometric properties.
• High dimension in text space.
• k and α are different from query to query.
Challenging Features:
• Lose Euclidean geometric properties.
• High dimension in text space.
• k and α are different from query to query.
N1
N3
N2
N4
y
x
p4
p2
p1
p5
q(0.5, 2.5)
p3
ObjVct2
[0, 1][0, 1]
ObjVct3
[1, 0][1, 0]
[4,4][4,4]
p2 p3
IntUniVct11
[4,4][4,4]
p1ObjVct1
N1 N2
N4
ObjVct4
[3, 2.5][3, 2.5]
ObjVct5
[3.5, 1.5][3.5, 1.5]
p4 p5
N3
[0,0][1,1]
2
[3,1.5][3.5,2.5]
2
x
ObjVct1 1 1
ObjVct2 1 1
ObjVct3 5 5
ObjVct4 8 8
ObjVct5 1 1
4
1
0
2.5
1.5
4
0
1
3
3.5
p1
p2
p3
p4
p5
q 0.5 2.5 ObjVctQ 8 8
vectorsy
UniVct1 1 1
UniVct2 5 5
UniVct3 8 8
IntVct1 1 1
IntVct2 1 1
IntVct3 1 1
IntUniVct2
IntUniVct3
Intersection and Union R-tree (IUR-tree)
10
Overview of Search Algorithm
RSTkNN Algorithm: Travel from the IUR-tree root Progressively update lower and upper bounds Apply search strategy:
prune unrelated entries in Pruned; report entries to be results Ans; add candidate objects to Cnd.
FinalVerification For objects in Cnd, check whether results or not by
updating the bounds for candidates using expanding entries in Pruned.
14
N4
N1p1
N2p2 p3
N3p4 p5
EnQueue(U, N4);
Initialize N4.CLs;
N1
N3
N2
N4
y
x
p4
p2
p1
p5
q(0.5, 2.5)
p3
x
ObjVct1 1 1
ObjVct2 1 1
ObjVct3 5 5
ObjVct4 8 8
ObjVct5 1 1
4
1
0
2.5
1.5
4
0
1
3
3.5
p1
p2
p3
p4
p5
q 0.5 2.5 ObjVctQ 8 8
vectorsy
Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6
U
N4, (0, 0)15
Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6
U N4(0, 0)
DeQueue(U, N4) Mutual-effectN1 N2
N1 N3
N2 N3
N1
N3
N2
N4
y
x
p4
p2
p1
p5
q(0.5, 2.5)
p3
x
ObjVct1 1 1
ObjVct2 1 1
ObjVct3 5 5
ObjVct4 8 8
ObjVct5 1 1
4
1
0
2.5
1.5
4
0
1
3
3.5
p1
p2
p3
p4
p5
q 0.5 2.5 ObjVctQ 8 8
vectorsy
N4
N1p1
N2p2 p3
N3p4 p5
EnQueue(U, N2)EnQueue(U, N3)Pruned.add(N1)
Pruned N1(0.37, 0.432)
N3(0.323, 0.619 ) N2(0.21, 0.619 )
16
Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6
U
DeQueue(U, N3) Mutual-effectp4 N2
p5 p4,N2
N1
N3
N2
N4
y
x
p4
p2
p1
p5
q(0.5, 2.5)
p3
x
ObjVct1 1 1
ObjVct2 1 1
ObjVct3 5 5
ObjVct4 8 8
ObjVct5 1 1
4
1
0
2.5
1.5
4
0
1
3
3.5
p1
p2
p3
p4
p5
q 0.5 2.5 ObjVctQ 8 8
vectorsy
Answer.add(p4)Candidate.add(p5)
Pruned N1(0.37, 0.432)
N3(0.323, 0.619 ) N2(0.21, 0.619 )
Answer
Candidate
p4(0.21, 0.619 )
p5(0.374, 0.374)
N4
N1
p1
N2
p2 p3
N3
p4 p5
17
Example: Execution of the RSTkNN Algorithm on IUR-tree, given k=2, alpha=0.6
U
DeQueue(U, N2) Mutual-effectp2 p4,p5
p3 p2,p4,p5
N1
N3
N2
N4
y
x
p4
p2
p1
p5
q(0.5, 2.5)
p3
x
ObjVct1 1 1
ObjVct2 1 1
ObjVct3 5 5
ObjVct4 8 8
ObjVct5 1 1
4
1
0
2.5
1.5
4
0
1
3
3.5
p1
p2
p3
p4
p5
q 0.5 2.5 ObjVctQ 8 8
vectorsy
Answer.add(p2, p3)
Pruned.add(p5)
Pruned N1(0.37, 0.432)
N2(0.21, 0.619 )
Answer
Candidate
p4
p5(0.374, 0.374)
N4
N1
p1
N2
p2 p3
N3
p4 p5
p2 p3
So far since U=Cand=empty, algorithm ends.
Results: p2, p3, p4.
So far since U=Cand=empty, algorithm ends.
Results: p2, p3, p4.
18
Cluster IUR-tree: CIUR-tree
IUR-tree: Texts in an index node could be very different.
CIUR-tree: An enhanced IUR-tree by incorporating textual clusters. N1
N3
N2
N4
y
x
p4
p2
p1
p5
q(0.5, 2.5)
p3
ObjVct2
[0, 1][0, 1]
ObjVct3
[1, 0][1, 0]
[4,4][4,4]
p2 p3
IntUniVct1 1
[4,4][4,4]
p1ObjVct1
N1 N2
N4
ObjVct4
[3, 2.5][3, 2.5]
ObjVct5
[3.5, 1.5][3.5, 1.5]
p4 p5
N3
[0,0][1,1]
2
[3,1.5][3.5,2.5]
2
x
ObjVct1 1 1
ObjVct2 5 5
ObjVct3 5 5
ObjVct4 8 8
ObjVct5 1 1
4
1
0
2.5
1.5
4
0
1
3
3.5
p1
p2
p3
p4
p5
q 0.5 2.5 ObjVctQ 8 8
vectors word1
word2
y
word
2
word
1
UniVct1 1 1
UniVct2 5 5
UniVct3 8 8
word
2
word
1
IntVct1 1 1
IntVct2 1 1
IntVct3 1 1
IntUniVct2
IntUniVct3
C1:1
C2:2
C1:1, C3:1
C1
C2
C2
C3
C1
cluster
19
Optimizations
Motivation To give a tighter bound during CIUR-tree traversal To purify the textual description in the index node
Outlier Detection and Extraction (ODE-CIUR) Extract subtrees with outlier clusters Take the outliers into special account and calculate their
bounds separately.
Text-entropy based optimization (TE-CIUR) Define TextEntropy to depict the distribution of text
clusters in an entry of CIUR-tree Travel first for the entries with higher TextEntropy, i.e.
more diverse in texts.20
Experimental Study
Experimental Setup OS: Windows XP; CPU: 2.0GHz; Memory: 4GB Page size: 4KB; Language: C/C++.
Compared Methods baseline, IUR-tree, ODE-CIUR, TE-CIUR, and ODE-TE.
Datasets ShopBranches(Shop), extended from a small real data GeographicNames(GN), real data CaliforniaDBpedia(CD), generated combining location in California and documents
from DBpedia. Metric
Total query time Page access number
Statistics Shop CD GN
Total # of objects 304,008 1,555,209 1,868,821
Total unique words in dataset 3933 21,578 222,409
Average # words per object 45 47 4
21
Scalability
0.1
1
10
100
1000
10000
100000
1000000
10000000
50K 300K 550K 800K 1050K
dataset size
quer
y tim
e (s
ec)
baseline IUR-Tree
ODE-CIUR TE-CIUR
ODE-TE
0
2
4
6
8
50K 300K 550K 800K 1050K
dataset size
qu
ery
tim
e (s
ec)
baseline IUR-Tree
ODE-CIUR TE-CIUR
ODE-TE
0.2K 3K 40K 550K 4M
(1) Log-scale version (2) Linear-scale version
22
Effect of k
0
1
2
3
4
1 3 5 7 9
k
quer
y ti
me
(sec
)
IUR-Tree ODE-CIUR TE-CIUR ODE-TE
Query time
23
Conclusion
Propose a new query problem RSTkNN. Present a hybrid index IUR-Tree. Show the enhanced variant CIUR-Tree
and two optimizations ODE-CIUR and TE-CIUR to further improve search processing.
24
Current and future works
Distributed XML query processing
Cloud-based SQL Processing
Spatial and Temporal Keyword search
Thank youQ&A