String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)
String Matching with Mismatches
Some slides are stolen from Moshe Lewenstein (Bar Ilan University)
String Matching with Mismatches
Landau – Vishkin 1986
Galil – Giancarlo 1986
Abrahamson 1987
Amir - Lewenstein - Porat 2000
Approximate String Matching
problem: Find all text locations where distance from pattern is sufficiently small.
distance metric: HAMMING DISTANCE
Let S = s1s2…sm
R = r1r2…rm
Ham(S,R) = The number of locations j where sj rj
Example: S = ABCABC R = ABBAAC
Ham(S,R) = 2
Example:
P = A B B A A C T = A B C A A B C A C…
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Problem 1: Counting mismatches
Example:
P = A B B A A C T = A B C A A B C A C… 2
Ham(P,T1) = 2
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Counting mismatches
Example:
P = A B B A A C T = A B C A A B C A C… 2, 4
Ham(P,T2) = 4
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Counting mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C… 2, 4, 6
Ham(P,T3) = 6
Counting mismatches
Example:
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2
Ham(P,T4) = 2
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Counting mismatches
Example:
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Counting mismatches
Example: k = 2
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …
Problem 2: k-mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: Every i where Ham(P, titi+1…ti+m-1) ≤ k
Example: k = 2
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, … 1, 0, 0, 1,
Problem 2: k-mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: Every i where Ham(P, titi+1…ti+m-1) ≤ kh
Naïve Algorithm(for counting mismatches or k-mismatches problem)
Running Time: O(nm) n = |T|, m = |P|
- Goto each location of text and compute hamming distance of P and Ti
The Kangaroo Method(for k-mismatches)
Landau – Vishkin 1986
Galil – Giancarlo 1986
The Kangaroo Method(for k-mismatches)
-Create suffix tree (+ lca) for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
- Do up to k LCP queries for every text location
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
The Kangaroo Method(for k-mismatches)
Preprocess:
Build suffix tree of both P and T - O(n+m) timeLCA preprocessing - O(n+m) time
Check P at given text location
Kangroo jump till next mismatch - O(k) time
Overall time: O(nk)
How do we do counting in less than O(nm) ?
Lets start with binary strings
0 1 0 1 1 0 1 1P =
0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 ......T =
If we pad the pattern and the text with zeros this is like a convolution of two vectors of length m+n
We can do this using FFT in O(nlog(n)) time
a b a c c a c b a c a b a c c P =
a b b b c c c a a a a b a c b ......T =
And if the strings are not binary ?
a b a c c a c b a c a b a c c P =
a b a c c a c b a c a b a c c a-mask
a b b b c c c a a a a b a c b ......T =
a b a c c a c b a c a b a c c P =
a b a c c a c b a c a b a c c a-mask
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
a b b b c c c a a a a b a c b ......T =
a b a c c a c b a c a b a c c P =
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
a b b b c c c a a a a b a c b ......T =
a b a c c a c b a c a b a c c P =
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
a b b b c c c a a a a b a c b ......T =
a b b b c c c a a a a b a c b ......not-amask
a b a c c a c b a c a b a c c P =
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
a b b b c c c a a a a b a c b ......T =
a b b b c c c a a a a b a c b ......not-amask
0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a
a b a c c a c b a c a b a c c P =
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
a b b b c c c a a a a b a c b ......T =
0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a
a b a c c a c b a c a b a c c P =
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
a b b b c c c a a a a b a c b ......T =
0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a
Multiply Pa and T(not a) to count mismatches using “a”
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a ... ...
a b a c c a c b a c a b a c c P =
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
a b b b c c c a a a a b a c b ......T =
0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a
Multiply Pa and Tnot a to count mismatches using a
Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a ... ...
Boolean Convolutions (FFT) Method
Running Time: One boolean convolution - O(n log m) time
# of matches of all symbols - O(n| | log m) timeΣ
Boolean Convolutions (FFT) Method
How do we do counting in less than O(nm) ?
Lets count matches rather than mismatches
a b c d e b b d
b g d e f h d c c a b g h h ...
...
...
...
P =
T =
counter
increment
For each character you have a list of offsets where it occurs in the pattern,
When you see the char in the text, you increment the appropriate counters.
a b c d e b b d
b g d e f h d c c a b g h h ...
...
...
...
P =
T =
counter
increment
For each character you have a list of offsets where it occurs in the pattern,
When you see the char in the text, you increment the appropriate counters.
a b c d e b b d
b g d e f h d c c a b g h h ...
...
...
...
P =
T =
counter
increment
This is fast if all characters are “rare”
Partition the characters into rare and frequent
Rare: occurs ≤ c times in the pattern
For rare characters run this scheme with the counters
Takes O(nc) time
Frequest chars
You have at most m/c of them
Do a convolution for each
Total cost O(m/c n log(n)).
2
log( )
log( )
log( )
mcn n n
c
c m n
c m n
( log( ))O n m n
Fix c
Complexity:
Frequent Symbol: a symbol that appears at least times in P.k2
Back to the k-mismatch problem
Want to beat the O(nk) kangaroo bound
Few (≤√k) frequent symbols
Do the counters scheme for non-frequent
Convolve for each frequent O(n log n)k
O(n )k
(≥√k) frequent symbols
Intuition: There cannot be too many places where we match
(≥√k) frequent symbols
- Consider frequent symbols.
- For each of them consider the first appearances.
k2
k
Do the counters scheme just for these symbols and occurrences
k = 4, = 4k2
a b a c c a c b a c a b a c c P =
a b a c c a c b a c a b a c c
a b a c c a c b a c a b a c c
a-mask
c-mask
a b b b c c c a a a a b a c b ......
T =
use a-mask
Example of Masked Countingk = 4, = 4k2
a b a c c a c b a c a b a c c P =
a b a c c a c b a c a b a c c
a b a c c a c b a c a b a c c
a-mask
c-mask
a b b b c c c a a a a b a c b ......
T = a b a c c a c b a c a b a c c
d
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...... 1counter
Example of Masked Countingk = 4, = 4k2
a b a c c a c b a c a b a c c P =
a b a c c a c b a c a b a c c
a b a c c a c b a c a b a c c
a-mask
c-mask
a b b b c c c a a a a b a c b ......
T = a b a c c a c b a c a b a c c
d
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ...... 1counter
Counting Stage:
Run through text and count occurrencesof all marks.
Time: O(n ).k
For location i of T, if counteri < k then no match at location i.
Why? The total # of elements in all masks is 2 = 2k.
Important Observations:
1) Sum of all counters 2 n 2) Every counter whose value is less than k already has more than k errors.
k k
k
How many locations remain?
Sum of all counters: 2n
Value of potential matches > k
k
kn
kkn 22
The Kangaroo Method.
How do we check these locations?
Use
Kangaroo Method Time: O(k) per location
Overall Time: O( ) = O( )kkn
kn
# of potential matches:
Additional Points
Can reduce to
O( n )kk log