A Pivotal Prefix Based Filtering Algorithm for String Similarity Search

Dong Deng, Guoliang Li, Jianhua Feng

Database Group, Tsinghua University

Present by Dong Deng

A Pivotal Prefix Based Filtering Algorithm for String Similarity Search

Search is Important

Source: http://www.internetlivestats.com/google-search-statistics/

Google Searches per Year

Speed Matters

Source:

Data is Dirty

• Typos

• Typo in “title”relaxed

related

Argyrios Zymnis

Argyris Zymnis

DBLP Complete Search

Similarity Search

Query

String Dataset

All the strings similar to the query

• ED(r, s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s.

• For example: ED(sigcom, sigmod) = 2

Edit Distance

sigcom

sigmom

sigmod

substitute c with m

substitute m with d

Problem Definition

Query string s = “yotubecom” and τ = 2

string dataset R

ed(s, r4) <= 2output r4 as a result

Application

• Spell Checking• Copy Detection• Entity Linking• Bioinformatic ….

Challenge

Naïve MethodTime complexity: for each query

No

Filter-and-Verification Framework

Dataset R

Threshold τ

Query string s

ResultsFilter:

Signature(s) ∩Signature(r) = ϕ?

Verify:ED(r,s) ≤ τ?

YesIndex

Preliminary: q-gram

• q-gram of the substring with length q

yoouuttbbeeccoom

youtbecom

2-gram

dd

d

Preliminary: q-gram• 1 edit operation destroies at most q grams.

• τ edit operations destroy at most qτ grams.• if r and s have more than qτ mismatch grams, ED(r, s)>τ.

yout ecomyoou

utt eeccoom

Preliminary: Prefix FilterSort all q-grams by global ordering, such as idf

Pre(s)

q(r) : The sorted q-gram set of string rPre(r)

q(s): The sorted q-gram set of string s

Pre(•) is the prefix of q(•)

|Pre(•)|= qτ+1

Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ

suffix(r)


Pre(s)

g5 g6 g11 g12 g13g1 g2

g7 g8 g9 g10 g12g3 g4




|Pre(•)|= qτ+1


>g10 >g10 >g10 >g10 >g10 >g10

suffix(r)

d

d

Preliminary: disjoint q-gram• One edit operation destroies at most 1 disjoint gram.

• τ edit operations destroy at most τ disjoint grams.• if r and s have more than τ mismatch disjoint grams, ED(r, s)>

τ

yout ecom

e

yout

om


Pivotal Prefix FilterSort all q-grams by global ordering, such as idf

Pre(s)


Piv(•) is the pivotal prefix of q(•)|Piv(•)|= τ+1 and the q-grams in Piv(•) are disjoint

Piv(r)

Piv(s)

suffix(r)

If piv(s) ∩ pre(r) = ϕ and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ



Pre(s)

g8 g10g5

g6 g9 g11 g13g1 g3

q(r) : The sorted q-gram set of string r

Pivotal Prefix Filter: If last(s)> last(r) and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ

Pre(r)


Piv(r)

Piv(s)>g10 >g10 >g10 >g10 >g10 >g10 >g10

last(r)

last(s)

suffix(r)



Pre(s)

g6 g9 g12 g13g1 g4

g7 g10 g11g3

q(r) : The sorted q-gram set of string r

Pivotal Prefix Filter: If last(r)> last(s) and piv(s) ∩ pre(r) = ϕ, ED(r,s) > τ

Pre(r)


Piv(r)

Piv(s)

>g10 >g10 >g10 >g10 >g10 >g10 >g10

last(r)

last(s)

suffix(r)

Pivotal Prefix Filter

If last(r)> last(s) and piv(s) ∩ pre(r) = ϕ, ED(r,s) > τIf last(s)> last(r) and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ

• Existence: There must exist τ+1 disjoint grams in the prefix

• The Pivotal Prefix is a subset of the Prefix– The pivotal prefix filter dominates the prefix filter– Signature size are O(τ) and O(qτ) respectively

Related WorkMethod |Sig(r)| |Sig(s)|

Prefix Filter O(qτ) O(qτ)

Mismatch Filter O(qτ) O(qτ)

Qchunk Filter O(τ) O(l)Pivotal Prefix Filter O(τ) O(qτ)

• Mismatch Filter [Xiao VLDB08] : Shorten prefix length, but still O(qτ)• Qchunk Filter[Qin SIGMOD11] : Shorten one to O(τ) but increased the other one to O(l)• Adaptive Prefix[Wang SIGMOD12]

– Increase prefix length to reduce candidate number– Orthogonal and can be integrated into our method

• Flamingo[Li ICDE08]– Based on count filter. Accelerating counting process.– Orthogonal and can be integrated into our method

Pivotal Search Algorithm

• Indexing– Build inverted indexes for both the prefix and the pivotal prefix of the data strings

• Querying– Generate prefix and pivotal prefix for the query string– Probe the prefix index with the pivotal prefix of the query– Probe the pivotal prefix index with the prefix of the query– Verify the candidates and output results

Pivotal Prefix Selection

Evaluating Different Pivotal Prefixes: The longer the inverted lists we probe, the more candidates we may have.

min𝑝𝑖𝑣 (𝑠)

∑𝑔∈ 𝑝𝑖𝑣(𝑠 )

h𝑙𝑒𝑛𝑔𝑡 𝑜𝑓 𝑖𝑛𝑣𝑒𝑟𝑡𝑒𝑑 𝑙𝑖𝑠𝑡𝑜𝑓 𝑔

min𝑝𝑖𝑣 (𝑟 )

∑𝑔∈ 𝑝𝑖𝑣(𝑟 )

𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦𝑜𝑓 𝑔

For query string:

For data string:

Optimal Pivotal Prefix SelectionDynamic Programming:

Select m-1 optimal pivotal q-grams from the first n-1 q-grams in prefix

Select as last pivotal q-gram

Object: Select m=τ+1 optimal pivotal q-grams from the first n=qτ+1 grams in the prefix


Select m-1 optimal pivotal q-grams from the first n-2 q-grams



Select m-1 optimal pivotal q-grams from the first m-1 q-grams


𝑓 (𝑚 ,𝑛 )= min1≤ 𝑘≤𝑚

¿

𝑤 h𝑒𝑖𝑔 𝑡 𝑖𝑠 h𝑙𝑒𝑛𝑔𝑡 𝑜𝑓 𝑖𝑛𝑣𝑒𝑟𝑡𝑒𝑑𝑙𝑖𝑠𝑡 𝑓𝑜𝑟 𝑞𝑢𝑒𝑟𝑦 𝑎𝑛𝑑 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑓𝑜𝑟 𝑑𝑎𝑡𝑎𝑠𝑡𝑟𝑖𝑛𝑔

Recursive formula:

No

Filter-and-Verification Framework

Dataset R

Threshold τ

Query string s

ResultsFilter:

Signature(s) ∩Signature(r) = ϕ?

Verify:alignment filter?If yes, ED(r,s) ≤

τ?

YesIndex

Complexity Improvement: Improved from to

Alignment Filter

Intuition of Alignment Filter: suppose in the best case we need erri edit operations to transform to a substring of r, then

If

Alignment Filter

is the minimum edit distance between and any substring of r.

Substring edit distance (sed)

Alignment filter: If

Alignment Filter

Accelerating Calculation: • The computation complexity of sed(, r) is O(). • By position filter, can only align to a substring xi of r

where |xi|<. • Thus if , ED( , )𝑟 𝑠• The complexity reduced to

Complexity Improvement: Improved from to

Experiments

Settings:C++, g++ 4.8.2 with -O3 flags64bit Ubuntu Server 12.04 LTS versionIntel Xeon E5-2650 2.00GHz processor and 16GB memory.

Evaluating Pivotal Prefix FilterAverage Search Time

Mismatch: From EDJoinCrossFiler: Cross FilterPivotalFilter: PivotalFilterCrossSelect: CrossFilter + Pivotal Prefix SelectionPivotalSearch: PivotalFilter + Pivotal Prefix Selection

Evaluating Pivotal Prefix FilterCandidate Number

Mismatch: From EDJoinCrossFiler: Cross FilterPivotalFilter: PivotalFilterCrossSelect: CrossFilter + Pivotal Prefix SelectionPivotalSearch: PivotalFilter + Pivotal Prefix Selection

Evaluating Alignment FilterAverage Search Time

NoFilter: without any filterContentFilter: From EDJoinAlignFilter: Alignment Filter

Evaluating Alignment FilterCandidate Number

NoFilter: without any filterContentFilter: From EDJoinAlignFilter: Alignment FilterReal: Number of results

Comparison with State-of-the-arts

PivotalSearch: Our methodAdaptive: [Wang2012]Flamingo: [Li2008]Qchunk: [Qin 2011]

Scalability

Conclusion

• Pivotal prefix filter• Pivotal search algorithm• Optimal pivotal prefix selection• Alignment filter

THANK YOUQ & A

Project hompage: http://dbgroup.cs.tsinghua.edu.cn/dd/pivotal.html

Outline

• Problem Definition• Pivotal Prefix Filter• The Similarity Search Algorithm• Alignment Filter• Experiment• Conclusion

Outline

• Motivation and Problem Definition• Pivotal Prefix Filter• The Similarity Search Algorithm• Alignment Filter• Experiment• Conclusion

Outline


Outline


Outline


Complexity

• Space Complexity: • Time Complexity:

Pivotal Prefix Selection

Evaluating Different Pivotal Prefixes: The longer the inverted lists we scan, the larger the filtering cost is and the smaller the pruning power is.



¿ 𝐼 +¿[𝑔 ]∨¿¿¿



¿ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 [𝑔]∨¿¿

For query string:

For data string:

Existence of Pivotal Prefix:There must exist at least τ+1 disjoint q-grams in the prefix pre(r) for any string r

Complexity• Space Complexity: – Prefix Inverted Index Size: – Pivotal Prefix Inverted Index Size:

• Query Time Complexity:– Preprocess Query s: – Probing Inverted Indexes: where is the average

length of probed prefix inverted lists

• Verification Complexity: where c is the number of candidates and l is average string length

Complexity• Space Complexity: – Prefix Inverted Index Size: – Pivotal Prefix Inverted Index Size:

• Query Time Complexity:– Preprocess Query s: – Probing Inverted Indexes: where is the average

length of probed prefix inverted lists

• Verification Complexity: where c is the number of candidates and l is average string length


Pre(s)

g5 g6 g9 g10 g11g1 g2

g7 g8 g11 g12 g13g3 g4




|Pre(•)|= qτ+1


>g10 >g10 >g10 >g10 >g10 >g10 >g10

Alignment Filternon-consecutive errors:

youtubecomyoytupecxm

q=3, the 3 non-consecutive errors destroy 8 q-grams

youtubecomyoutzpxcom

q=3, the 3 consecutive errors only destroy 5 q-grams

consecutive errors:

Indexing

• Fix a global gram order

We use gram frequency ascending order τ=2 q=2

Global gram order

im my te bu un nt uc bb tb oy yt ca om yo ou ut ub co tu be ec

1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 4

Indexing

• Build inverted indexes for prefix and pivotal prefix

Global gram order


1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 4

Sort and Split String,

Sort q-grams

q(r 1): {i m, my,t e, ca, yo ou, ut , ec}q(r 2): {bu, un, nt , uc, om ub, co, t u}q(r 3): {bb, ou, ut , ub, co t u, be, ec}q(r 4): {t b, om, yo,ou, ut co, be, ec}q(r 5): {oy, yt , ca, yo, ub t u, be, ec}

last(pre(ri))τ=2 q=2

slt(ri)

pre(ri)

Piv(ri)

Indexing

• Build inverted indexes for prefix and pivotal prefix

q(r 1): {i m, my,t e, ca, yo ou, ut , ec}q(r 2): {bu, un, nt , uc, om ub, co, t u}q(r 3): {bb, ou, ut , ub, co t u, be, ec}q(r 4): {t b, om, yo,ou, ut co, be, ec}q(r 5): {oy, yt , ca, yo, ub t u, be, ec}

pre(ri)

slt(ri)

imtebuntuctbyt

<r1,1>ca

omyoouutub

Inverted index I

<r1,6><r2,2><r2,4><r2,6><r4,4><r5,3>

<r1,8><r4,8><r4,1><r3,8><r3,1><r3,3>

Inverted index I

immytebuunnt

uc bb tb oy ytco

caom

ouutub

<r5,3>

+

<r1,1><r1,2><r1,6><r2,2><r2,3><r2,4>

<r2,6><r3,4><r4,4><r5,2>

<r5,7>

<r1,8><r5,8><r2,8><r4,8>

<r1,3><r4,1><r5,1>yo

<r3,3> <r5,5><r3,1> <r4,3><r3,8> <r4,2>

-

Pivotal Prefix Index Prefix IndexPiv(ri

)

Querying

• Generate prefix and pivotal prefix for the query string

Global gram order


1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 4

s: yotubecom pr e(s): {ot , om, yo, ub, co} pi v(s): {ot , om, ub}last(pre(s))

Querying

• Probe the prefix index with the pivotal prefix of the query• Probe the pivotal prefix index with the prefix of the query

Inverted index I

imtebuntuctbyt

<r1,1>ca

omyoouutub

Inverted index I

s: yotubecom pr e(s): {ot , om, yo, ub, co} pi v(s): {ot , om, ub}

Preprocess Probe ProbeQuerying

immytebuunnt

uc bb tb oy ytco

caom

ouutub

<r5,3>

last(pre(s))

+-

<r1,6><r2,2><r2,4><r2,6><r4,4><r5,3>

<r1,8><r5,8><r4,8><r4,1><r5,1><r3,8><r3,1><r3,3>

<r1,1><r1,2><r1,6><r2,2><r2,3><r2,4>

<r2,6><r3,4><r4,4><r5,2>

<r5,7>

<r1,8><r5,8><r2,8><r4,8>

<r1,3><r4,1><r5,1>yo

<r3,3> <r5,5><r3,1> <r4,3><r3,8> <r4,2>

Querying

• Verify the candidates and output results

Inverted index I

imtebuntuctbyt

<r1,1>ca

omyoouutub

Inverted index I

s: yotubecom pr e(s): {ot , om, yo, ub, co} pi v(s): {ot , om, ub}

Preprocess Probe ProbeQuerying

immytebuunnt

uc bb tb oy ytco

caom

ouutub

<r5,3>

last(pre(s))

+-

<r1,6><r2,2><r2,4><r2,6><r4,4><r5,3>

<r1,8><r5,8><r4,8><r4,1><r5,1><r3,8><r3,1><r3,3>

<r1,1><r1,2><r1,6><r2,2><r2,3><r2,4>

<r2,6><r3,4><r4,4><r5,2>

<r5,7>

<r1,8><r5,8><r2,8><r4,8>

<r1,3><r4,1><r5,1>yo

<r3,3> <r5,5><r3,1> <r4,3><r3,8> <r4,2>

Candidates: r3, r4, r5

Result:r4

verify

A Pivotal Prefix Based Filtering Algorithm for String Similarity Search

Documents

prefix of q pre

sorted qgram

prefix of qpre

prefix length

string rpivotal prefix

pivotal prefix of qpiv

disjoint qgramone

disjoint grams