Pattern Matching Part Two: k-mismatches

Advanced Algorithms – COMS31900

Pattern matching part four

Pattern matching with at most k mismatches

Benjamin Sach

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c

a b d

a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

m

i.e. the number of distinct j such that P [j] 6= T [i+ j]

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]


T


P

ba b c

a b d

a a d ad a



c a a

m


0 1 2 3 4 5 6 7 8 9 10 11 12

n

a

Ham(4) = 1



T


P

ba b c a a d ad a



c a a


0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(5) = 4



T


P

ba b c a a d ad a



c a a


0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(6) = 1



T


P

ba b c a a d ad a



c a a


0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(7) = 3



T


P

ba b c a a d ad a



c a a


0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(7) = 3


this is alignment 7


T


P

ba b c a a d ad a



c a a


0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(8) = 3


this is alignment 8


T


P

ba b c a a d ad a



c a a


Last lecture we saw two algorithms for this problem:

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(8) = 3


this is alignment 8


T


P

ba b c a a d ad a



c a a


Last lecture we saw two algorithms for this problem:

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(8) = 3


this is alignment 8

One algorithm takesO(n|Σ| logm) time (where |Σ| is the alphabet size)

The other algorithm takesO(n√m logm) time (regardless of the alphabet size)

Pattern matching with few mismatches (k-mismatch)

T

Input A text string T (length n), a pattern string P (lengthm) and a positive integer k

P

ba b c

a b d

a a d ad a

Goal: For all i, output,

a a a

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a

Output the number of mismatches. . . unless it’s more than k

(we interpret the outputX to mean “too many mismatches”)

Hamk(i) =

Ham(i) if Ham(i) 6 k

X if Ham(i) > k


T


P

ba b c

a b d

a a d ad a


a a a

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a



Hamk(i) =


X if Ham(i) > k

k = 2


T


P

ba b c

a b d

a a d ad a


a a a

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a

Hamk(4) = 1



Hamk(i) =


X if Ham(i) > k

k = 2


T


P

ba b c a a d ad a


a a a

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Hamk(5) = X



Hamk(i) =


X if Ham(i) > k

k = 2


T


P

ba b c a a d ad a


a a a

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Hamk(6) = 1



Hamk(i) =


X if Ham(i) > k

k = 2


T


P

ba b c a a d ad a


a a a

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Hamk(7) = 2



Hamk(i) =


X if Ham(i) > k

k = 2


T


P

ba b c a a d ad a


a a a

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Hamk(8) = X



Hamk(i) =


X if Ham(i) > k

k = 2


T


P

ba b c a a d ad a


a a a

• We could use theO(n√m logm) time algorithm for Hamming distance. . .

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Hamk(8) = X



Hamk(i) =


X if Ham(i) > k

k = 2


T


P

ba b c a a d ad a


a a a

• We could use theO(n√m logm) time algorithm for Hamming distance. . .

but when k is small we can do much better

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Hamk(8) = X



Hamk(i) =


X if Ham(i) > k

k = 2

LCP - the Longest Common Prefix

T ba b c a b aa b cb a b

For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that

T [i . . . i+ `− 1] = P [j . . . j + `− 1]

it’s the furthest you can go before hitting a mismatch

a b c b a bc bP

nn nm




T [i . . . i+ `− 1] = P [j . . . j + `− 1]


a b c b a bc bP

nn nm




T [i . . . i+ `− 1] = P [j . . . j + `− 1]


i j

a b c b a bc bP

nn nm




T [i . . . i+ `− 1] = P [j . . . j + `− 1]


i j

a b c b a bc bP

nn nm

LCP(i, j) returns 3




T [i . . . i+ `− 1] = P [j . . . j + `− 1]


i j

a b c b a bc bP

nn nm

LCP(i, j) returns 4




T [i . . . i+ `− 1] = P [j . . . j + `− 1]


a b c b a bc bP

nn nm

i jLCP(i, j) returns 0




T [i . . . i+ `− 1] = P [j . . . j + `− 1]


i j

a b c b a bc bP

nn nm

LCP(i, j) returns 4




T [i . . . i+ `− 1] = P [j . . . j + `− 1]


i j

a

We can preprocess P and T for LCP queries inO(n) time andO(n) space

b c b a bc bP

nn nm

LCP(i, j) returns 4




T [i . . . i+ `− 1] = P [j . . . j + `− 1]


i j

a


Each query then takesO(1) time

b c b a bc bP

nn nm

LCP(i, j) returns 4




T [i . . . i+ `− 1] = P [j . . . j + `− 1]


i j

a



we’ll see how to do this later in this lecture

b c b a bc bP

nn nm

LCP(i, j) returns 4




T [i . . . i+ `− 1] = P [j . . . j + `− 1]


i j

a



we’ll see how to do this later in this lecture

we can use LCP queries to solve the k-mismatch problem. . .

b c b a bc bP

nn nm

First let’s see how

LCP(i, j) returns 4

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

a a b a


a

b

c


b c b a bP

nn

a a b a


a

b

c


b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

a a b a


a

b

c


b c b a bP

nn



We can do this using (at most) k + 1 LCP queries

a a b a


a

b

c


b c b a bP

nn




each query takesO(1) time and finds a new mismatch

a a b a


a

b

c

aT a b c a b aa b cb b

i

a

b c b a bP

nn





a a b a


a

b

c


i

j

a

b c b a bP

nn





a a b a


a

b

c


i

j

a

b c b a bP

nn





perform LCP(i, j)

a a b a


a

b

c


i

j

a

b c b a bP

nn





perform LCP(i, j)

a a b a


a

b

c


i

j

a

b c b a bP

nn





perform LCP(i, j)

a a b a which returns 3


a

b

c


i

j

a

b c b a bP

nn





perform LCP(i, j)



a

b

c


i

j

a

b c b a bP

nn





perform LCP(i, j)


mismatch!


a

b

c


i

j

a

b c b a bP

nn





a a b a


a

b

c


i

j

a

b c b a bP

nn





a a b a

skip over this!


a

b

c


b c b a bP

nn





i′

j

a a b a

skip over this!


a

b

c


b c b a bP

nn





i′

j

perform LCP(i′, j)

a a b a


a

b

c


b c b a bP

nn





i′

j




a

b

c


b c b a bP

nn





i′

j

a a b a


a

b

c


b c b a bP

nn





i′

j

a a b a


a

b

c


b c b a bP

nn





perform LCP(i′, j)i′

j

a a b a


a

b

c


b c b a bP

nn





perform LCP(i′, j)i′

j



a

b

c


b c b a bP

nn





i′

j

a a b a


a

b

c


b c b a bP

nn






a a b a

i′

j


a

b

c


b c b a bP

nn






a a b a

i′

j

which returns 3


a

b

c


b c b a bP

nn





(or we reach the end of P )


a a b a

i′

j

which returns 3


a

b

c


b c b a bP

nn






a a b a

i′

j


a

b

c


b c b a bP

nn






We can therefore calculate Hamk(i) for a single alignment i inO(k) time

a a b a

i′

j


a

b

c


b c b a bP

nn







Overall this takesO(nk) time (including the ‘preprocessing’ for LCP queries)

a a b a

i′

j


a

b

c


b c b a bP

nn








this is pretty good but we can do better

a a b a

i′

j


a

b

c


b c b a bP

nn








this is pretty good but we can do better but first. . . how do we answer those LCP queries?

a a b a

i′

j

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries

1

5

inO(n) prep. time and space


TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$


of this text

2

What is the LCA of the leaves representing suffixes i and j?


1

5



TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$


of this text

2



1

5

i j3



TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$


of this text

2



1

5

i j

LCA(i, j)

3



TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$


of this text

2



1

5

i j

LCA(i, j)

3



TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$


of this text

2



1

5

i j

LCA(i, j)

3

it’s the node representing the longest common prefix of T [i . . . n− 1] and T [j . . . n− 1]



TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$


of this text

2



1

5

i j

LCA(i, j)

3



For any pair of locations i, j in T , LCPT (i, j) is the largest ` such that

T [i . . . i+ `− 1] = T [j . . . j + `− 1]

Single string LCP:


TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$


of this text

2



1

5


LCA(i, j)

j

1

5


i


T [i . . . i+ `− 1] = T [j . . . j + `− 1]

Single string LCP:


TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$


of this text

2



1

5


LCA(i, j)

j

4

i

2



T [i . . . i+ `− 1] = T [j . . . j + `− 1]

Single string LCP:


TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$


of this text

2



1

5


LCA(i, j)

j

4

i

2



TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$


of this text

2



1

5


LCA(i, j)

j

4

i

2

We store the root-to-node length at each internal nodeso we can recover the length, LCPT (i, j) inO(1) time



TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$


of this text

2



1

5


LCA(i, j)

j

4

i

2

We store the root-to-node length at each internal node

1

3

0

2

so we can recover the length, LCPT (i, j) inO(1) time



TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$


of this text

2



1

5


LCA(i, j)

j

4

i

2

We store the root-to-node length at each internal node

1

3

0

2

so we can recover the length, LCPT (i, j) inO(1) time


So we haveO(n) space,O(n) prep. time andO(1) query time for the LCP problem on a single string


TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$


of this text

2

1

5

LCA(i, j)

j

4

i

2

1

3

0

2



TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$


of this text

2

1

5

LCA(i, j)

j

4

i

2

1

3

0

2


We can extend this two strings (P and T ) by first concatenating them together. . .(and proceeding as for a single string)


TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$


of this text

2

1

5

LCA(i, j)

j

4

i

2

1

3

0

2


We can extend this two strings (P and T ) by first concatenating them together. . .

So we also haveO(n) space,O(n) prep. time andO(1) query time

(and proceeding as for a single string)

for the LCP problem on two strings


TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$


of this text

2

1

5

LCA(i, j)

j

4

i

2

1

3

0

2


We can extend this two strings (P and T ) by first concatenating them together. . .

So we also haveO(n) space,O(n) prep. time andO(1) query time

(and proceeding as for a single string)

for the LCP problem on two strings

I.e. for any i in T and j in P , LCP(i, j) is the largest ` such that

T [i . . . i+ `− 1] = P [j . . . j + `− 1] (as we originally defined it)

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

and infrequent otherwise



P a d bc ab b da

0 1 2 3 4 5 6 7 8


k = 4

(√k = 2)

m = 9



P a d bc ab b da

0 1 2 3 4 5 6 7 8


k = 4

(√k = 2)



P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent


k = 4

(√k = 2)



P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent , b is frequent


k = 4

(√k = 2)



P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent , b is frequent


k = 4

(√k = 2)

, d is frequent



P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent , b is frequentc is infrequent


k = 4

(√k = 2)

, d is frequent



P a d bc ab b da

0 1 2 3 4 5 6 7 8



How many frequent symbols can there be?

k = 4

(√k = 2)

, d is frequent



P a d bc ab b da

0 1 2 3 4 5 6 7 8



How many frequent symbols can there be? Lots!

k = 4

(√k = 2)

, d is frequent



P a d bc ab b da

0 1 2 3 4 5 6 7 8



How many frequent symbols can there be? Lots! there could be m√k

frequent symbols

k = 4

(√k = 2)

, d is frequent



P a d bc ab b da

0 1 2 3 4 5 6 7 8




frequent symbols

Case 1: There are fewer than 2√k frequent symbols in P .

k = 4

(√k = 2)

, d is frequent



P a d bc ab b da

0 1 2 3 4 5 6 7 8



Algorithm summary


frequent symbols


k = 4

(√k = 2)

, d is frequent



P a d bc ab b da

0 1 2 3 4 5 6 7 8



Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent

Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)

Stage 2: Count all matches involving infrequent symbols (as in last lecture)


frequent symbols


k = 4

(√k = 2)

, d is frequent



P a d bc ab b da

0 1 2 3 4 5 6 7 8



Algorithm summary





frequent symbols


-O(m logm) time

k = 4

(√k = 2)

, d is frequent



P a d bc ab b da

0 1 2 3 4 5 6 7 8



Algorithm summary





frequent symbols


-O(m logm) time

-O(n√k logm) time

k = 4

(√k = 2)

, d is frequent



P a d bc ab b da

0 1 2 3 4 5 6 7 8



Algorithm summary





frequent symbols


-O(m logm) time

-O(n√k logm) time

-O(n√k) time

k = 4

(√k = 2)

, d is frequent



P a d bc ab b da

0 1 2 3 4 5 6 7 8



Algorithm summary





frequent symbols


-O(m logm) time

-O(n√k logm) time

-O(n√k) time

-O(n√k logm) total time

k = 4

(√k = 2)

, d is frequent

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a d bc ab b da c bbfe0 1 2 3 4 5 6 7 8 9 10 11 12 13





k = 4d bc ab b daP a d bc ab b da c bbfe0 1 2 3 4 5 6 7 8 9 10 11 12 13

a is frequent , b is frequente and f are infrequent

, d is frequent, c is frequent





k = 4d bc ab b daP a dcb b da c bbfe0 1 2 3 4 5 6 7 8 9 10 11 12 13







k = 4d bc ab b daP a dcb b da c bbfe0 1 2 3 4 5 6 7 8 9 10 11 12 13







k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

0 1 2 3 4 5 6 7 8 9 10 11 12 13








J = {0, 2, 3, 4, 5, 7, 9, 10}






J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations






J = {0, 2, 3, 4, 5, 7, 9, 10}



b da b d bfT b b c caa f e f cc c aa ab e e

i = 4






J = {0, 2, 3, 4, 5, 7, 9, 10}




i = 4 dk(i) = 3






J = {0, 2, 3, 4, 5, 7, 9, 10}




i = 4 dk(i) = 3

Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)






J = {0, 2, 3, 4, 5, 7, 9, 10}




i = 4 dk(i) = 3


because there are 2k interesting positions. . . and fewer than k of them match






J = {0, 2, 3, 4, 5, 7, 9, 10}




i = 4 dk(i) = 3



Fact There are at most n/√k values of i with dk(i) > k






J = {0, 2, 3, 4, 5, 7, 9, 10}




i = 4 dk(i) = 3




this follows from a counting argument






J = {0, 2, 3, 4, 5, 7, 9, 10}




i = 4 dk(i) = 3







J = {0, 2, 3, 4, 5, 7, 9, 10}




i = 4 dk(i) = 3


For any location i′, T [i′] = P [j] for either 0 or√k distinct j ∈ J






J = {0, 2, 3, 4, 5, 7, 9, 10}




i = 4 dk(i) = 3



This implies that∑

i dk(i) 6∑

i′∑

j∈J Eq(T [i′], P [j]) 6 n√k






J = {0, 2, 3, 4, 5, 7, 9, 10}




i = 4 dk(i) = 3




i dk(i) 6∑

i′∑

j∈J Eq(T [i′], P [j]) 6 n√k

Eq = 1 if T [i′] = P [j] (and

0 otherwise)






J = {0, 2, 3, 4, 5, 7, 9, 10}




i = 4 dk(i) = 3




i dk(i) 6∑

i′∑

j∈J Eq(T [i′], P [j]) 6 n√k






J = {0, 2, 3, 4, 5, 7, 9, 10}




i = 4 dk(i) = 3


Assume that more than n/√k values of i have dk(i) > k






J = {0, 2, 3, 4, 5, 7, 9, 10}




i = 4 dk(i) = 3



So∑

i dk(i) >(

n√k

+ 1)· k






J = {0, 2, 3, 4, 5, 7, 9, 10}




i = 4 dk(i) = 3



So∑

i dk(i) >(

n√k

+ 1)· k > n

√k






J = {0, 2, 3, 4, 5, 7, 9, 10}




i = 4 dk(i) = 3



So∑

i dk(i) >(

n√k

+ 1)· k > n

√k

Contradiction!






J = {0, 2, 3, 4, 5, 7, 9, 10}




i = 4 dk(i) = 3




this follows from a counting argument






J = {0, 2, 3, 4, 5, 7, 9, 10}


i = 4

We can filter the text, using the dk(i) values leaving only n/√k alignments to check

every other alignment has more than k mismatches






J = {0, 2, 3, 4, 5, 7, 9, 10}


i = 4



Check each of the remaining alignments using LCP queries inO(k) timeper alignment






J = {0, 2, 3, 4, 5, 7, 9, 10}


i = 4



Check each of the remaining alignments using LCP queries inO(k) time

This takes n/√k ·O(k) = O(n

√k) total time.

per alignment






J = {0, 2, 3, 4, 5, 7, 9, 10}


i = 4



Check each of the remaining alignments using LCP queries inO(k) time

How do we compute all the dk(i) values?

This takes n/√k ·O(k) = O(n

√k) total time.

per alignment

Computing all the dk(i) values





J = {0, 2, 3, 4, 5, 7, 9, 10}


i = 4

For each text character T [i′],

For each j ∈ J such that P [j] = T [i′]



Let dk(i) = 0 for all i.

increase dk(i′ − j) by one






J = {0, 2, 3, 4, 5, 7, 9, 10}


i = 4







T [i′] = P [j] for either 0 or√k distinct j ∈ J

(store a list of j values for each symbol)






J = {0, 2, 3, 4, 5, 7, 9, 10}


i = 4







T [i′] = P [j] for either 0 or√k distinct j ∈ J

(store a list of j values for each symbol)

This takesO(n√k) total time.

Pattern matching with k-mismatches: putting it all together

Algorithm summary


Algorithm summary

Preprocess P, T for LCP queries -O(n) time


Algorithm summary


Count the number of frequent symbols in P -O(m logm) time


Algorithm summary



Case 1: P has at most 2√k frequent symbols

Case 2: P has more than 2√k frequent symbols


Algorithm summary




Count matches with frequent symbols using cross-correlations -O(n√k logm) time



Algorithm summary





Count matches with infrequent symbols directly -O(n√k) time



Algorithm summary







Filter the text, leaving n/√k alignments -O(n

√k) time


Algorithm summary








√k) time

Count mismatches at these alignments using LCP queries -O(n√k) time


Algorithm summary


Overall, we obtain a time complexity of O(n√k logm).







√k) time



Algorithm summary


Overall, we obtain a time complexity of O(n√k logm).







√k) time


- this can be improved toO(n√k log k)

Conclusion

T


P

ba b c a a d ad a


a a a

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Hamk(6) = 1

Output the number of mismatches. . . unless it’s more than k(we interpret the outputX to mean “too many mismatches”)

Hamk(i) =


X if Ham(i) > k

k = 2

One algorithm takesO(nk) time

The other algorithm takesO(n√k logm) time (improvable toO(n

√k log k) time)

We saw two algorithms for this problem:

Pattern Matching Part Two: k-mismatches

Education