Top Banner
String Matching with Mismatches Some slides are stolen from Moshe Lewenstein (Bar Ilan University)
52

String Matching with Mismatches

Jan 21, 2016

Download

Documents

xena

String Matching with Mismatches. Some slides are stolen from Moshe Lewenstein (Bar Ilan University). String Matching with Mismatches. Landau – Vishkin 1986 Galil – Giancarlo 1986 Abrahamson 1987 - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: String  Matching with Mismatches

String Matching with Mismatches

Some slides are stolen from Moshe Lewenstein (Bar Ilan University)

Page 2: String  Matching with Mismatches

String Matching with Mismatches

Landau – Vishkin 1986

Galil – Giancarlo 1986

Abrahamson 1987

Amir - Lewenstein - Porat 2000

Page 3: String  Matching with Mismatches

Approximate String Matching

problem: Find all text locations where distance from pattern is sufficiently small.

distance metric: HAMMING DISTANCE

Let S = s1s2…sm

R = r1r2…rm

Ham(S,R) = The number of locations j where sj rj

Example: S = ABCABC R = ABBAAC

Ham(S,R) = 2

Page 4: String  Matching with Mismatches

Example:

P = A B B A A C T = A B C A A B C A C…

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Problem 1: Counting mismatches

Page 5: String  Matching with Mismatches

Example:

P = A B B A A C T = A B C A A B C A C… 2

Ham(P,T1) = 2

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Counting mismatches

Page 6: String  Matching with Mismatches

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4

Ham(P,T2) = 4

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Counting mismatches

Page 7: String  Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6

Ham(P,T3) = 6

Counting mismatches

Page 8: String  Matching with Mismatches

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2

Ham(P,T4) = 2

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Counting mismatches

Page 9: String  Matching with Mismatches

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Counting mismatches

Page 10: String  Matching with Mismatches

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Problem 2: k-mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: Every i where Ham(P, titi+1…ti+m-1) ≤ k

Page 11: String  Matching with Mismatches

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, … 1, 0, 0, 1,

Problem 2: k-mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: Every i where Ham(P, titi+1…ti+m-1) ≤ kh

Page 12: String  Matching with Mismatches

Naïve Algorithm(for counting mismatches or k-mismatches problem)

Running Time: O(nm) n = |T|, m = |P|

- Goto each location of text and compute hamming distance of P and Ti

Page 13: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

Landau – Vishkin 1986

Galil – Giancarlo 1986

Page 14: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

-Create suffix tree (+ lca) for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 15: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 16: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 17: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 18: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 19: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 20: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 21: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

- Do up to k LCP queries for every text location

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 22: String  Matching with Mismatches

The Kangaroo Method(for k-mismatches)

Preprocess:

Build suffix tree of both P and T - O(n+m) timeLCA preprocessing - O(n+m) time

Check P at given text location

Kangroo jump till next mismatch - O(k) time

Overall time: O(nk)

Page 23: String  Matching with Mismatches

How do we do counting in less than O(nm) ?

Page 24: String  Matching with Mismatches

Lets start with binary strings

0 1 0 1 1 0 1 1P =

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 ......T =

If we pad the pattern and the text with zeros this is like a convolution of two vectors of length m+n

We can do this using FFT in O(nlog(n)) time

Page 25: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

a b b b c c c a a a a b a c b ......T =

And if the strings are not binary ?

Page 26: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c a-mask

a b b b c c c a a a a b a c b ......T =

Page 27: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c a-mask

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

Page 28: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

Page 29: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

a b b b c c c a a a a b a c b ......not-amask

Page 30: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

a b b b c c c a a a a b a c b ......not-amask

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Page 31: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Page 32: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Multiply Pa and T(not a) to count mismatches using “a”

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a ... ...

Page 33: String  Matching with Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Multiply Pa and Tnot a to count mismatches using a

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a ... ...

Page 34: String  Matching with Mismatches

Boolean Convolutions (FFT) Method

Page 35: String  Matching with Mismatches

Running Time: One boolean convolution - O(n log m) time

# of matches of all symbols - O(n| | log m) timeΣ

Boolean Convolutions (FFT) Method

Page 36: String  Matching with Mismatches

How do we do counting in less than O(nm) ?

Lets count matches rather than mismatches

Page 37: String  Matching with Mismatches

a b c d e b b d

b g d e f h d c c a b g h h ...

...

...

...

P =

T =

counter

increment

For each character you have a list of offsets where it occurs in the pattern,

When you see the char in the text, you increment the appropriate counters.

Page 38: String  Matching with Mismatches

a b c d e b b d

b g d e f h d c c a b g h h ...

...

...

...

P =

T =

counter

increment

For each character you have a list of offsets where it occurs in the pattern,

When you see the char in the text, you increment the appropriate counters.

Page 39: String  Matching with Mismatches

a b c d e b b d

b g d e f h d c c a b g h h ...

...

...

...

P =

T =

counter

increment

This is fast if all characters are “rare”

Page 40: String  Matching with Mismatches

Partition the characters into rare and frequent

Rare: occurs ≤ c times in the pattern

For rare characters run this scheme with the counters

Takes O(nc) time

Page 41: String  Matching with Mismatches

Frequest chars

You have at most m/c of them

Do a convolution for each

Total cost O(m/c n log(n)).

Page 42: String  Matching with Mismatches

2

log( )

log( )

log( )

mcn n n

c

c m n

c m n

( log( ))O n m n

Fix c

Complexity:

Page 43: String  Matching with Mismatches

Frequent Symbol: a symbol that appears at least times in P.k2

Back to the k-mismatch problem

Want to beat the O(nk) kangaroo bound

Page 44: String  Matching with Mismatches

Few (≤√k) frequent symbols

Do the counters scheme for non-frequent

Convolve for each frequent O(n log n)k

O(n )k

Page 45: String  Matching with Mismatches

(≥√k) frequent symbols

Intuition: There cannot be too many places where we match

Page 46: String  Matching with Mismatches

(≥√k) frequent symbols

- Consider frequent symbols.

- For each of them consider the first appearances.

k2

k

Do the counters scheme just for these symbols and occurrences

Page 47: String  Matching with Mismatches

k = 4, = 4k2

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c

a b a c c a c b a c a b a c c

a-mask

c-mask

a b b b c c c a a a a b a c b ......

T =

use a-mask

Page 48: String  Matching with Mismatches

Example of Masked Countingk = 4, = 4k2

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c

a b a c c a c b a c a b a c c

a-mask

c-mask

a b b b c c c a a a a b a c b ......

T = a b a c c a c b a c a b a c c

d

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...... 1counter

Page 49: String  Matching with Mismatches

Example of Masked Countingk = 4, = 4k2

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c

a b a c c a c b a c a b a c c

a-mask

c-mask

a b b b c c c a a a a b a c b ......

T = a b a c c a c b a c a b a c c

d

0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ...... 1counter

Page 50: String  Matching with Mismatches

Counting Stage:

Run through text and count occurrencesof all marks.

Time: O(n ).k

For location i of T, if counteri < k then no match at location i.

Why? The total # of elements in all masks is 2 = 2k.

Important Observations:

1) Sum of all counters 2 n 2) Every counter whose value is less than k already has more than k errors.

k k

k

Page 51: String  Matching with Mismatches

How many locations remain?

Sum of all counters: 2n

Value of potential matches > k

k

kn

kkn 22

The Kangaroo Method.

How do we check these locations?

Use

Kangaroo Method Time: O(k) per location

Overall Time: O( ) = O( )kkn

kn

# of potential matches:

Page 52: String  Matching with Mismatches

Additional Points

Can reduce to

O( n )kk log