Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity Search
Jan 29, 2016
Dong Deng, Guoliang Li, Jianhua Feng
Database Group, Tsinghua University
Present by Dong Deng
A Pivotal Prefix Based Filtering Algorithm for String Similarity Search
Search is Important
Source: http://www.internetlivestats.com/google-search-statistics/
Google Searches per Year
Speed Matters
Source:
Data is Dirty
• Typos
• Typo in “title”relaxed
related
Argyrios Zymnis
Argyris Zymnis
DBLP Complete Search
Similarity Search
Query
String Dataset
All the strings similar to the query
• ED(r, s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s.
• For example: ED(sigcom, sigmod) = 2
Edit Distance
sigcom
sigmom
sigmod
substitute c with m
substitute m with d
Problem Definition
Query string s = “yotubecom” and τ = 2
string dataset R
ed(s, r4) <= 2output r4 as a result
Application
• Spell Checking• Copy Detection• Entity Linking• Bioinformatic ….
Challenge
Naïve MethodTime complexity: for each query
No
Filter-and-Verification Framework
Dataset R
Threshold τ
Query string s
ResultsFilter:
Signature(s) ∩Signature(r) = ϕ?
Verify:ED(r,s) ≤ τ?
YesIndex
Preliminary: q-gram
• q-gram of the substring with length q
yoouuttbbeeccoom
youtbecom
2-gram
dd
d
Preliminary: q-gram• 1 edit operation destroies at most q grams.
• τ edit operations destroy at most qτ grams.• if r and s have more than qτ mismatch grams, ED(r, s)>τ.
yout ecomyoou
utt eeccoom
Preliminary: Prefix FilterSort all q-grams by global ordering, such as idf
Pre(s)
q(r) : The sorted q-gram set of string rPre(r)
q(s): The sorted q-gram set of string s
Pre(•) is the prefix of q(•)
|Pre(•)|= qτ+1
Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ
suffix(r)
Preliminary: Prefix FilterSort all q-grams by global ordering, such as idf
Pre(s)
g5 g6 g11 g12 g13g1 g2
g7 g8 g9 g10 g12g3 g4
q(r) : The sorted q-gram set of string rPre(r)
q(s): The sorted q-gram set of string s
Pre(•) is the prefix of q(•)
|Pre(•)|= qτ+1
Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ
>g10 >g10 >g10 >g10 >g10 >g10
suffix(r)
d
d
Preliminary: disjoint q-gram• One edit operation destroies at most 1 disjoint gram.
• τ edit operations destroy at most τ disjoint grams.• if r and s have more than τ mismatch disjoint grams, ED(r, s)>
τ
yout ecom
e
yout
om
q(s): The sorted q-gram set of string s
Pivotal Prefix FilterSort all q-grams by global ordering, such as idf
Pre(s)
q(r) : The sorted q-gram set of string rPre(r)
Piv(•) is the pivotal prefix of q(•)|Piv(•)|= τ+1 and the q-grams in Piv(•) are disjoint
Piv(r)
Piv(s)
suffix(r)
If piv(s) ∩ pre(r) = ϕ and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ
q(s): The sorted q-gram set of string s
Pivotal Prefix FilterSort all q-grams by global ordering, such as idf
Pre(s)
g8 g10g5
g6 g9 g11 g13g1 g3
q(r) : The sorted q-gram set of string r
Pivotal Prefix Filter: If last(s)> last(r) and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ
Pre(r)
Piv(•) is the pivotal prefix of q(•)|Piv(•)|= τ+1 and the q-grams in Piv(•) are disjoint
Piv(r)
Piv(s)>g10 >g10 >g10 >g10 >g10 >g10 >g10
last(r)
last(s)
suffix(r)
q(s): The sorted q-gram set of string s
Pivotal Prefix FilterSort all q-grams by global ordering, such as idf
Pre(s)
g6 g9 g12 g13g1 g4
g7 g10 g11g3
q(r) : The sorted q-gram set of string r
Pivotal Prefix Filter: If last(r)> last(s) and piv(s) ∩ pre(r) = ϕ, ED(r,s) > τ
Pre(r)
Piv(•) is the pivotal prefix of q(•)|Piv(•)|= τ+1 and the q-grams in Piv(•) are disjoint
Piv(r)
Piv(s)
>g10 >g10 >g10 >g10 >g10 >g10 >g10
last(r)
last(s)
suffix(r)
Pivotal Prefix Filter
If last(r)> last(s) and piv(s) ∩ pre(r) = ϕ, ED(r,s) > τIf last(s)> last(r) and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ
• Existence: There must exist τ+1 disjoint grams in the prefix
• The Pivotal Prefix is a subset of the Prefix– The pivotal prefix filter dominates the prefix filter– Signature size are O(τ) and O(qτ) respectively
Related WorkMethod |Sig(r)| |Sig(s)|
Prefix Filter O(qτ) O(qτ)
Mismatch Filter O(qτ) O(qτ)
Qchunk Filter O(τ) O(l)Pivotal Prefix Filter O(τ) O(qτ)
• Mismatch Filter [Xiao VLDB08] : Shorten prefix length, but still O(qτ)• Qchunk Filter[Qin SIGMOD11] : Shorten one to O(τ) but increased the other one to O(l)• Adaptive Prefix[Wang SIGMOD12]
– Increase prefix length to reduce candidate number– Orthogonal and can be integrated into our method
• Flamingo[Li ICDE08]– Based on count filter. Accelerating counting process.– Orthogonal and can be integrated into our method
Pivotal Search Algorithm
• Indexing– Build inverted indexes for both the prefix and the pivotal prefix of the data strings
• Querying– Generate prefix and pivotal prefix for the query string– Probe the prefix index with the pivotal prefix of the query– Probe the pivotal prefix index with the prefix of the query– Verify the candidates and output results
Pivotal Prefix Selection
Evaluating Different Pivotal Prefixes: The longer the inverted lists we probe, the more candidates we may have.
min𝑝𝑖𝑣 (𝑠)
∑𝑔∈ 𝑝𝑖𝑣(𝑠 )
h𝑙𝑒𝑛𝑔𝑡 𝑜𝑓 𝑖𝑛𝑣𝑒𝑟𝑡𝑒𝑑 𝑙𝑖𝑠𝑡𝑜𝑓 𝑔
min𝑝𝑖𝑣 (𝑟 )
∑𝑔∈ 𝑝𝑖𝑣(𝑟 )
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦𝑜𝑓 𝑔
For query string:
For data string:
Optimal Pivotal Prefix SelectionDynamic Programming:
Select m-1 optimal pivotal q-grams from the first n-1 q-grams in prefix
Select as last pivotal q-gram
Object: Select m=τ+1 optimal pivotal q-grams from the first n=qτ+1 grams in the prefix
Optimal Pivotal Prefix SelectionDynamic Programming:
Select m-1 optimal pivotal q-grams from the first n-2 q-grams
Select as last pivotal q-gram
Optimal Pivotal Prefix SelectionDynamic Programming:
Select m-1 optimal pivotal q-grams from the first m-1 q-grams
Select as last pivotal q-gram
𝑓 (𝑚 ,𝑛 )= min1≤ 𝑘≤𝑚
¿
𝑤 h𝑒𝑖𝑔 𝑡 𝑖𝑠 h𝑙𝑒𝑛𝑔𝑡 𝑜𝑓 𝑖𝑛𝑣𝑒𝑟𝑡𝑒𝑑𝑙𝑖𝑠𝑡 𝑓𝑜𝑟 𝑞𝑢𝑒𝑟𝑦 𝑎𝑛𝑑 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑓𝑜𝑟 𝑑𝑎𝑡𝑎𝑠𝑡𝑟𝑖𝑛𝑔
Recursive formula:
No
Filter-and-Verification Framework
Dataset R
Threshold τ
Query string s
ResultsFilter:
Signature(s) ∩Signature(r) = ϕ?
Verify:alignment filter?If yes, ED(r,s) ≤
τ?
YesIndex
Complexity Improvement: Improved from to
Alignment Filter
Intuition of Alignment Filter: suppose in the best case we need erri edit operations to transform to a substring of r, then
If
Alignment Filter
is the minimum edit distance between and any substring of r.
Substring edit distance (sed)
Alignment filter: If
Alignment Filter
Accelerating Calculation: • The computation complexity of sed(, r) is O(). • By position filter, can only align to a substring xi of r
where |xi|<. • Thus if , ED( , )𝑟 𝑠• The complexity reduced to
Complexity Improvement: Improved from to
Experiments
Settings:C++, g++ 4.8.2 with -O3 flags64bit Ubuntu Server 12.04 LTS versionIntel Xeon E5-2650 2.00GHz processor and 16GB memory.
Evaluating Pivotal Prefix FilterAverage Search Time
Mismatch: From EDJoinCrossFiler: Cross FilterPivotalFilter: PivotalFilterCrossSelect: CrossFilter + Pivotal Prefix SelectionPivotalSearch: PivotalFilter + Pivotal Prefix Selection
Evaluating Pivotal Prefix FilterCandidate Number
Mismatch: From EDJoinCrossFiler: Cross FilterPivotalFilter: PivotalFilterCrossSelect: CrossFilter + Pivotal Prefix SelectionPivotalSearch: PivotalFilter + Pivotal Prefix Selection
Evaluating Alignment FilterAverage Search Time
NoFilter: without any filterContentFilter: From EDJoinAlignFilter: Alignment Filter
Evaluating Alignment FilterCandidate Number
NoFilter: without any filterContentFilter: From EDJoinAlignFilter: Alignment FilterReal: Number of results
Comparison with State-of-the-arts
PivotalSearch: Our methodAdaptive: [Wang2012]Flamingo: [Li2008]Qchunk: [Qin 2011]
Scalability
Conclusion
• Pivotal prefix filter• Pivotal search algorithm• Optimal pivotal prefix selection• Alignment filter
THANK YOUQ & A
Project hompage: http://dbgroup.cs.tsinghua.edu.cn/dd/pivotal.html
Outline
• Problem Definition• Pivotal Prefix Filter• The Similarity Search Algorithm• Alignment Filter• Experiment• Conclusion
Outline
• Motivation and Problem Definition• Pivotal Prefix Filter• The Similarity Search Algorithm• Alignment Filter• Experiment• Conclusion
Outline
• Problem Definition• Pivotal Prefix Filter• The Similarity Search Algorithm• Alignment Filter• Experiment• Conclusion
Outline
• Problem Definition• Pivotal Prefix Filter• The Similarity Search Algorithm• Alignment Filter• Experiment• Conclusion
Outline
• Problem Definition• Pivotal Prefix Filter• The Similarity Search Algorithm• Alignment Filter• Experiment• Conclusion
Complexity
• Space Complexity: • Time Complexity:
Pivotal Prefix Selection
Evaluating Different Pivotal Prefixes: The longer the inverted lists we scan, the larger the filtering cost is and the smaller the pruning power is.
min𝑝𝑖𝑣 (𝑟 )
∑𝑔∈ 𝑝𝑖𝑣(𝑟 )
¿ 𝐼 +¿[𝑔 ]∨¿¿¿
min𝑝𝑖𝑣 (𝑟 )
∑𝑔∈ 𝑝𝑖𝑣(𝑟 )
¿ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 [𝑔]∨¿¿
For query string:
For data string:
Existence of Pivotal Prefix:There must exist at least τ+1 disjoint q-grams in the prefix pre(r) for any string r
Complexity• Space Complexity: – Prefix Inverted Index Size: – Pivotal Prefix Inverted Index Size:
• Query Time Complexity:– Preprocess Query s: – Probing Inverted Indexes: where is the average
length of probed prefix inverted lists
• Verification Complexity: where c is the number of candidates and l is average string length
Complexity• Space Complexity: – Prefix Inverted Index Size: – Pivotal Prefix Inverted Index Size:
• Query Time Complexity:– Preprocess Query s: – Probing Inverted Indexes: where is the average
length of probed prefix inverted lists
• Verification Complexity: where c is the number of candidates and l is average string length
Preliminary: Prefix FilterSort all q-grams by global ordering, such as idf
Pre(s)
g5 g6 g9 g10 g11g1 g2
g7 g8 g11 g12 g13g3 g4
q(r) : The sorted q-gram set of string rPre(r)
q(s): The sorted q-gram set of string s
Pre(•) is the prefix of q(•)
|Pre(•)|= qτ+1
Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ
>g10 >g10 >g10 >g10 >g10 >g10 >g10
Alignment Filternon-consecutive errors:
youtubecomyoytupecxm
q=3, the 3 non-consecutive errors destroy 8 q-grams
youtubecomyoutzpxcom
q=3, the 3 consecutive errors only destroy 5 q-grams
consecutive errors:
Indexing
• Fix a global gram order
We use gram frequency ascending order τ=2 q=2
Global gram order
im my te bu un nt uc bb tb oy yt ca om yo ou ut ub co tu be ec
1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 4
Indexing
• Build inverted indexes for prefix and pivotal prefix
Global gram order
im my te bu un nt uc bb tb oy yt ca om yo ou ut ub co tu be ec
1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 4
Sort and Split String,
Sort q-grams
q(r 1): {i m, my,t e, ca, yo ou, ut , ec}q(r 2): {bu, un, nt , uc, om ub, co, t u}q(r 3): {bb, ou, ut , ub, co t u, be, ec}q(r 4): {t b, om, yo,ou, ut co, be, ec}q(r 5): {oy, yt , ca, yo, ub t u, be, ec}
last(pre(ri))τ=2 q=2
slt(ri)
pre(ri)
Piv(ri)
Indexing
• Build inverted indexes for prefix and pivotal prefix
q(r 1): {i m, my,t e, ca, yo ou, ut , ec}q(r 2): {bu, un, nt , uc, om ub, co, t u}q(r 3): {bb, ou, ut , ub, co t u, be, ec}q(r 4): {t b, om, yo,ou, ut co, be, ec}q(r 5): {oy, yt , ca, yo, ub t u, be, ec}
pre(ri)
slt(ri)
imtebuntuctbyt
<r1,1>ca
omyoouutub
Inverted index I
<r1,6><r2,2><r2,4><r2,6><r4,4><r5,3>
<r1,8><r4,8><r4,1><r3,8><r3,1><r3,3>
Inverted index I
immytebuunnt
uc bb tb oy ytco
caom
ouutub
<r5,3>
+
<r1,1><r1,2><r1,6><r2,2><r2,3><r2,4>
<r2,6><r3,4><r4,4><r5,2>
<r5,7>
<r1,8><r5,8><r2,8><r4,8>
<r1,3><r4,1><r5,1>yo
<r3,3> <r5,5><r3,1> <r4,3><r3,8> <r4,2>
-
Pivotal Prefix Index Prefix IndexPiv(ri
)
Querying
• Generate prefix and pivotal prefix for the query string
Global gram order
im my te bu un nt uc bb tb oy yt ca om yo ou ut ub co tu be ec
1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 4
s: yotubecom pr e(s): {ot , om, yo, ub, co} pi v(s): {ot , om, ub}last(pre(s))
Querying
• Probe the prefix index with the pivotal prefix of the query• Probe the pivotal prefix index with the prefix of the query
Inverted index I
imtebuntuctbyt
<r1,1>ca
omyoouutub
Inverted index I
s: yotubecom pr e(s): {ot , om, yo, ub, co} pi v(s): {ot , om, ub}
Preprocess Probe ProbeQuerying
immytebuunnt
uc bb tb oy ytco
caom
ouutub
<r5,3>
last(pre(s))
+-
<r1,6><r2,2><r2,4><r2,6><r4,4><r5,3>
<r1,8><r5,8><r4,8><r4,1><r5,1><r3,8><r3,1><r3,3>
<r1,1><r1,2><r1,6><r2,2><r2,3><r2,4>
<r2,6><r3,4><r4,4><r5,2>
<r5,7>
<r1,8><r5,8><r2,8><r4,8>
<r1,3><r4,1><r5,1>yo
<r3,3> <r5,5><r3,1> <r4,3><r3,8> <r4,2>
Querying
• Verify the candidates and output results
Inverted index I
imtebuntuctbyt
<r1,1>ca
omyoouutub
Inverted index I
s: yotubecom pr e(s): {ot , om, yo, ub, co} pi v(s): {ot , om, ub}
Preprocess Probe ProbeQuerying
immytebuunnt
uc bb tb oy ytco
caom
ouutub
<r5,3>
last(pre(s))
+-
<r1,6><r2,2><r2,4><r2,6><r4,4><r5,3>
<r1,8><r5,8><r4,8><r4,1><r5,1><r3,8><r3,1><r3,3>
<r1,1><r1,2><r1,6><r2,2><r2,3><r2,4>
<r2,6><r3,4><r4,4><r5,2>
<r5,7>
<r1,8><r5,8><r2,8><r4,8>
<r1,3><r4,1><r5,1>yo
<r3,3> <r5,5><r3,1> <r4,3><r3,8> <r4,2>
Candidates: r3, r4, r5
Result:r4
verify