Top Banner
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University
69

Survey: String Matching with k Mismatches

Feb 06, 2016

Download

Documents

EMELDA

Survey: String Matching with k Mismatches. Moshe Lewenstein Bar Ilan University. String Matching with k Mismatches. Landau – Vishkin 1986 Galil – Giancarlo 1986 Abrahamson 1987 - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Survey: String  Matching with k Mismatches

Survey: String Matching with k Mismatches

Moshe Lewenstein Bar Ilan University

Page 2: Survey: String  Matching with k Mismatches

String Matching with k Mismatches

Landau – Vishkin 1986

Galil – Giancarlo 1986

Abrahamson 1987

Amir - Lewenstein - Porat 2000

Page 3: Survey: String  Matching with k Mismatches

Exact String Matching

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A…

Page 4: Survey: String  Matching with k Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3

Exact String Matching

Page 5: Survey: String  Matching with k Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7

Exact String Matching

Page 6: Survey: String  Matching with k Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7 11

Exact String Matching

Page 7: Survey: String  Matching with k Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A…

Answer: {3,7,11,..}

Exact String Matching

Page 8: Survey: String  Matching with k Mismatches

Exact String Matching

Problem: Matching not exact in applications of:

• Computational Biology

• Musicology

• Text Editing

• Meteorology

• etc.

Need other definitions of string matching!

Page 9: Survey: String  Matching with k Mismatches

Approximate String Matching

Idea: Find all text locations where distance from pattern is sufficiently small.

distance metric: HAMMING DISTANCE

Let S = s1s2…sm

R = r1r2…rm

Ham(S,R) = The number of locations j where sj rj

Example: S = ABCABC R = ABBAAC

Ham(S,R) = 2

Page 10: Survey: String  Matching with k Mismatches

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C…

Page 11: Survey: String  Matching with k Mismatches

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2

Ham(P,T1) = 2

Page 12: Survey: String  Matching with k Mismatches

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4

Ham(P,T2) = 4

Page 13: Survey: String  Matching with k Mismatches

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6

Ham(P,T3) = 6

Page 14: Survey: String  Matching with k Mismatches

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2

Ham(P,T4) = 2

Page 15: Survey: String  Matching with k Mismatches

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Page 16: Survey: String  Matching with k Mismatches

Input: T = t1 . . . tn, P = p1 … pm

String Matching with k Mismatches

Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Page 17: Survey: String  Matching with k Mismatches

Input: T = t1 . . . tn, P = p1 … pm

String Matching with k Mismatches

Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Page 18: Survey: String  Matching with k Mismatches

Input: T = t1 . . . tn, P = p1 … pm

String Matching with k Mismatches

Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, … Y,N,N,Y,…

Page 19: Survey: String  Matching with k Mismatches

Naïve Algorithm(for counting mismatches or k-mismatches problem)

Running Time: O(nm) n = |T|, m = |P|

- Goto each location of text and compute hamming distance of P and Ti

Page 20: Survey: String  Matching with k Mismatches

The Kangaroo Method(for k-mismatches)

Landau – Vishkin 1986

Galil – Giancarlo 1986

Page 21: Survey: String  Matching with k Mismatches

Trie

• A tree representing a set of strings.

ab

c

e

e

f

d b

f

e g

{ aeef ad bbfe bbfg c }

Page 22: Survey: String  Matching with k Mismatches

Trie (Cont)

• Assume no string is a prefix of another

ab

c

e

e

f

d b

f

e g

Each string corresponds to a leaf.

Page 23: Survey: String  Matching with k Mismatches

Compressed Trie • Compress unary nodes, label edges by strings

ab

c

e

e

f

d b

f

e g

a

bbf

c

eefd

e g

Page 24: Survey: String  Matching with k Mismatches

Suffix tree

Suffix tree of string s:a compressed trie of all suffixes of s

Prefix-free: add a special character, say $, at the end of s

Page 25: Survey: String  Matching with k Mismatches

Suffix tree (Example) Let s = abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$

{ $ b$ ab$ bab$ abab$ }

ab

ab$

ab$

b

$

$

$

Page 26: Survey: String  Matching with k Mismatches

Suffix Tree properties

- Succint in space - O(n).

- Can be built in O(n) time. McCreight, Weiner,

Ukkonen, Farach-Colton

b

12

ab

a

b$

a

b$

3

$ 4

$

5

$

Page 27: Survey: String  Matching with k Mismatches

Exact string matching

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

Given a pattern P = ab we traverse the tree according to the pattern.

s=abab$

Page 28: Survey: String  Matching with k Mismatches

Exact string matching

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

Leaves correspond to locations of appearance!

s=abab$ 1 3

Page 29: Survey: String  Matching with k Mismatches

Exact string matching

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

Prepare Tree: O(n) time

Find matches: O(m + occ) time occ = # of matches

s=abab$ 1 3

Page 30: Survey: String  Matching with k Mismatches

Lowest common ancestors

A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Page 31: Survey: String  Matching with k Mismatches

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

s = abbaab$

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaa

b$

Page 32: Survey: String  Matching with k Mismatches

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaa

b$

s = abbaab$ aab$

Page 33: Survey: String  Matching with k Mismatches

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaa

b$

s = abbaab$ aab$ abbaab$

Page 34: Survey: String  Matching with k Mismatches

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaa

b$

s = abbaab$

aab$ abbaab$

Page 35: Survey: String  Matching with k Mismatches

LCA/LCP propertiesa

1

3

b

aa

b

ab$

b

5

$

2

b

4

b$

a6

$

7

$

b

$

aa

ab

$

Preprocesssing time : O(n)

Query Time: O(1)

Harel & Tarjan 1984, Schieber & Vishkin 1988, Berkman & Vishkin 1993

Page 36: Survey: String  Matching with k Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 37: Survey: String  Matching with k Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 38: Survey: String  Matching with k Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 39: Survey: String  Matching with k Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 40: Survey: String  Matching with k Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 41: Survey: String  Matching with k Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 42: Survey: String  Matching with k Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 43: Survey: String  Matching with k Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 44: Survey: String  Matching with k Mismatches

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i

Page 45: Survey: String  Matching with k Mismatches

The Kangaroo Method(for k-mismatches)

Preprocess:

Build suffix tree of both P and T - O(n+m) timeLCA preprocessing - O(n+m) time

Check P at given text location

Kangroo jump till next mismatch - O(k) time

Overall time: O(nk)

Page 46: Survey: String  Matching with k Mismatches

a b a c c a c b a c a b a c c P =

a b b b c c c a a a a b a c b ......T =

Boolean Convolutions (FFT) Method

Page 47: Survey: String  Matching with k Mismatches

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c a-mask

a b b b c c c a a a a b a c b ......T =

Boolean Convolutions (FFT) Method

Page 48: Survey: String  Matching with k Mismatches

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c a-mask

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

Boolean Convolutions (FFT) Method

Page 49: Survey: String  Matching with k Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

Boolean Convolutions (FFT) Method

Page 50: Survey: String  Matching with k Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

a b b b c c c a a a a b a c b ......not-amask

Boolean Convolutions (FFT) Method

Page 51: Survey: String  Matching with k Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

a b b b c c c a a a a b a c b ......not-amask

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Boolean Convolutions (FFT) Method

Page 52: Survey: String  Matching with k Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Boolean Convolutions (FFT) Method

Page 53: Survey: String  Matching with k Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Multiply Pa and Tnot a to count mismatches (use FFT)

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a ... ...

Boolean Convolutions (FFT) Method

Page 54: Survey: String  Matching with k Mismatches

a b a c c a c b a c a b a c c P =

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

a b b b c c c a a a a b a c b ......T =

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a

Multiply Pa and Tnot a to count mismatches (use FFT)

Pa 1 0 1 0 0 1 0 0 1 0 1 0 1 0 0

0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 Tnot a ... ...

Boolean Convolutions (FFT) Method

Page 55: Survey: String  Matching with k Mismatches

Boolean Convolutions (FFT) Method

Page 56: Survey: String  Matching with k Mismatches

Running Time: One boolean convolution - O(n log m) time

# of matches of all symbols - O(n| | log m) timeΣ

Boolean Convolutions (FFT) Method

Page 57: Survey: String  Matching with k Mismatches

Counting Method

Input: Text: T = t1…tn

Pattern: P = p1…pm

Max # of allowed mismatches: k

Assumption: Each pattern element is distinct

a b c d e f g h

b g d e f h d c c a b g h h ...

...

...

...

Count matches (instead of mismatches)

P =

T =

counter

increment

Page 58: Survey: String  Matching with k Mismatches

O(n log m) Algorithm

Frequent Symbol: a symbol that appears at least times in P.

Case 1: At least frequent symbols.

- Consider first frequent symbols.

- For each of them construct a mask for first appearances.

k2

k

k

k2

k

We distinguish between two cases:

Case 2: Less than frequent symbols.k

Case 1: At least frequent symbols.k

Page 59: Survey: String  Matching with k Mismatches

Example of Masked Countingk = 4, = 4k2

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c

a b a c c a c b a c a b a c c

a-mask

c-mask

a b b b c c c a a a a b a c b ......

T =

use a-mask

Page 60: Survey: String  Matching with k Mismatches

Example of Masked Countingk = 4, = 4k2

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c

a b a c c a c b a c a b a c c

a-mask

c-mask

a b b b c c c a a a a b a c b ......

T = a b a c c a c b a c a b a c c

d

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...... 1counter

Page 61: Survey: String  Matching with k Mismatches

Example of Masked Countingk = 4, = 4k2

a b a c c a c b a c a b a c c P =

a b a c c a c b a c a b a c c

a b a c c a c b a c a b a c c

a-mask

c-mask

a b b b c c c a a a a b a c b ......

T = a b a c c a c b a c a b a c c

d

0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ...... 1counter

Page 62: Survey: String  Matching with k Mismatches

Counting Stage:

Run through text and count occurrencesof all marks.

Time: O(n ).k

For location i of T, if counteri < k then no match at location i.

Why? The total # of elements in all masks is 2 = 2k.

Important Observations:

1) Sum of all counters 2 n 2) Every counter whose value is less than k already has more than k errors.

k k

k

Page 63: Survey: String  Matching with k Mismatches

How many locations remain?

Sum of all counters: 2n

Value of potential matches > k

k

kn

kkn 22

The Kangaroo Method.

How do we check these locations?

Use

Kangaroo Method Time: O(k) per location

Overall Time: O( ) = O( )kkn kn

# of potential matches:

Page 64: Survey: String  Matching with k Mismatches

Case 2: X frequent symbols, x < k

a) Count all matches of frequent symbols - one boolean convolution per symbol.

k

b) For non-frequent symbols, build full masks.

Time: O(x n log m) = O( n log m)

Symbol non-frequent appears < 2 in P mask size < 2kk

Count time: O(n )k

Page 65: Survey: String  Matching with k Mismatches

So, Case 2 is O(n log m)k

Overall Algo. Time: O(n log m)k

c) Add results of a) & b) and get total number of matches at every text location.

Time:a) O(n log m)b) O(n )c) O(n)

k

k

Page 66: Survey: String  Matching with k Mismatches

Additional Points

1. O(n log k)k

mknkn log

3

mk 31

For there is a linear time

algorithm - O( )

2. O( n )kk log

Better tradeoff:

Define frequent symbol > kk log

Page 67: Survey: String  Matching with k Mismatches

O( ) time algorithmmknkn log

3

Outline:

1. Find 2k special substrings of pattern.

2. Construct forest data structure combining info of special pattern substrings and text.

3. Use local counting arguments and quick queries to forest data structure to prune candidates.

4. Use kangaroo method to check leftover potential candidates.

Page 68: Survey: String  Matching with k Mismatches

k-Mismatches and Matrix Multiplication

“Or-And” matrix multiplication:

AxB = C, cij = aik bkj nk 1

Pattern all-mismatch problem: Find all text locations where the pattern mismatches at every character.

Indyk: If there is an algorithm faster than O(n ) for the Pattern all-mismatch problem then there is a new method for solving “Or-And” matrix multiplication faster than O(n3)

m

Page 69: Survey: String  Matching with k Mismatches

OPEN PROBLEMS

Hamming Distance in time:O(n log m)

Edit Distance?

Other metrics?