String Matching with Mismatches

String Matching with Mismatches

Some slides are stolen from Moshe Lewenstein (Bar Ilan University)

String Matching with Mismatches

Landau – Vishkin 1986

Galil – Giancarlo 1986

Abrahamson 1987

Amir - Lewenstein - Porat 2000

Approximate String Matching

problem: Find all text locations where distance from pattern is sufficiently small.

distance metric: HAMMING DISTANCE

Let S = s1s2…sm

R = r1r2…rm

Ham(S,R) = The number of locations j where sj rj

Example: S = ABCABC R = ABBAAC

Ham(S,R) = 2

Example:

P = A B B A A C T = A B C A A B C A C…

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Problem 1: Counting mismatches

Example:

P = A B B A A C T = A B C A A B C A C… 2

Ham(P,T1) = 2


P = p1 … pm


Counting mismatches

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4

Ham(P,T2) = 4


P = p1 … pm


Counting mismatches


P = p1 … pm


Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6

Ham(P,T3) = 6

Counting mismatches

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2

Ham(P,T4) = 2


P = p1 … pm


Counting mismatches

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …


P = p1 … pm


Counting mismatches

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Problem 2: k-mismatches


P = p1 … pm

Output: Every i where Ham(P, titi+1…ti+m-1) ≤ k

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, … 1, 0, 0, 1,

Problem 2: k-mismatches


P = p1 … pm

Output: Every i where Ham(P, titi+1…ti+m-1) ≤ kh

Naïve Algorithm(for counting mismatches or k-mismatches problem)

Running Time: O(nm) n = |T|, m = |P|

- Goto each location of text and compute hamming distance of P and Ti

The Kangaroo Method(for k-mismatches)

Landau – Vishkin 1986

Galil – Giancarlo 1986


-Create suffix tree (+ lca) for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i


- Create suffix tree for: s = P#T


Example:





Example:





Example:





Example:





Example:





Example:




- Do up to k LCP queries for every text location

Example:



Preprocess:

Build suffix tree of both P and T - O(n+m) timeLCA preprocessing - O(n+m) time

Check P at given text location

Kangroo jump till next mismatch - O(k) time

Overall time: O(nk)

How do we do counting in less than O(nm) ?

Lets start with binary strings

0 1 0 1 1 0 1 1P =

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 ......T =

If we pad the pattern and the text with zeros this is like a convolution of two vectors of length m+n

We can do this using FFT in O(nlog(n)) time

a b a c c a c b a c a b a c c P =

a b b b c c c a a a a b a c b ......T =

And if the strings are not binary ?


a b a c c a c b a c a b a c c a-mask



a b a c c a c b a c a b a c c a-mask

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0



Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0



Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0


a b b b c c c a a a a b a c b ......not-amask


Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0


a b b b c c c a a a a b a c b ......not-amask

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a


Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0


0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a


Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0


0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Multiply Pa and T(not a) to count mismatches using “a”

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a ... ...


Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0


0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Multiply Pa and Tnot a to count mismatches using a

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a ... ...

Boolean Convolutions (FFT) Method

Running Time: One boolean convolution - O(n log m) time

# of matches of all symbols - O(n| | log m) timeΣ

Boolean Convolutions (FFT) Method

How do we do counting in less than O(nm) ?

Lets count matches rather than mismatches

a b c d e b b d

b g d e f h d c c a b g h h ...

...

...

...

P =

T =

counter

increment

For each character you have a list of offsets where it occurs in the pattern,

When you see the char in the text, you increment the appropriate counters.

a b c d e b b d


...

...

...

P =

T =

counter

increment

For each character you have a list of offsets where it occurs in the pattern,

When you see the char in the text, you increment the appropriate counters.

a b c d e b b d


...

...

...

P =

T =

counter

increment

This is fast if all characters are “rare”

Partition the characters into rare and frequent

Rare: occurs ≤ c times in the pattern

For rare characters run this scheme with the counters

Takes O(nc) time

Frequest chars

You have at most m/c of them

Do a convolution for each

Total cost O(m/c n log(n)).

2

log( )

log( )

log( )

mcn n n

c

c m n

c m n

( log( ))O n m n

Fix c

Complexity:

Frequent Symbol: a symbol that appears at least times in P.k2

Back to the k-mismatch problem

Want to beat the O(nk) kangaroo bound

Few (≤√k) frequent symbols

Do the counters scheme for non-frequent

Convolve for each frequent O(n log n)k

O(n )k

(≥√k) frequent symbols

Intuition: There cannot be too many places where we match

(≥√k) frequent symbols

- Consider frequent symbols.

- For each of them consider the first appearances.

k2

k

Do the counters scheme just for these symbols and occurrences

k = 4, = 4k2


a b a c c a c b a c a b a c c


a-mask

c-mask

a b b b c c c a a a a b a c b ......

T =

use a-mask

Example of Masked Countingk = 4, = 4k2




a-mask

c-mask


T = a b a c c a c b a c a b a c c

d

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...... 1counter

Example of Masked Countingk = 4, = 4k2




a-mask

c-mask


T = a b a c c a c b a c a b a c c

d

0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ...... 1counter

Counting Stage:

Run through text and count occurrencesof all marks.

Time: O(n ).k

For location i of T, if counteri < k then no match at location i.

Why? The total # of elements in all masks is 2 = 2k.

Important Observations:

1) Sum of all counters 2 n 2) Every counter whose value is less than k already has more than k errors.

k k

k

How many locations remain?

Sum of all counters: 2n

Value of potential matches > k

k

kn

kkn 22

The Kangaroo Method.

How do we check these locations?

Use

Kangaroo Method Time: O(k) per location

Overall Time: O( ) = O( )kkn

kn

# of potential matches:

Additional Points

Can reduce to

O( n )kk log

String Matching with Mismatches

Documents