Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.

Post on 16-Jan-2016

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Improved string matching with k mismatches

(The Kangaroo Method)Galil, R. Giancarlo

SIGACT News, Vol. 17, No. 4, 1986 , pp. 52–54

Original: Moshe LewensteinModified by: Hsing-Yen Ann Date: Nov. 26, 2004

Exact String Matching

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A…

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3

Exact String Matching

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7

Exact String Matching

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7 11

Exact String Matching

Input: T = t1 . . . tn

P = p1 … pm

Output: All locations i of T where P appears Example:

P = A B C A A B T = A B A B C A A B C A A B C A A B A A…

Answer: {3,7,11,..}

Exact String Matching

Approximate String Matching

Idea: Find all text locations where distance from pattern is sufficiently small.

distance metric: HAMMING DISTANCE

Let S = s1s2…sm

R = r1r2…rm

Ham(S,R) = The number of locations j where sj rj

Example: S = ABCABC R = ABBAAC

Ham(S,R) = 2

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C…

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2

Ham(P,T1) = 2

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4

Ham(P,T2) = 4

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6

Ham(P,T3) = 6

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2

Ham(P,T4) = 2

String Matching with Mismatches

Input: T = t1 . . . tn

P = p1 … pm

Output: For each i in T Ham(P, titi+1…ti+m-1)

Example:

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Input: T = t1 . . . tn, P = p1 … pm

String Matching with k Mismatches

Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Input: T = t1 . . . tn, P = p1 … pm

String Matching with k Mismatches

Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …

Input: T = t1 . . . tn, P = p1 … pm

String Matching with k Mismatches

Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k

Example: k = 2

P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, … Y,N,N,Y,…

Naïve Algorithm(for counting mismatches or k-mismatches problem)

Running Time: O(nm) n = |T|, m = |P|

- Goto each location of text and compute hamming distance of P and Ti

The Kangaroo Method(for k-mismatches)

Landau – Vishkin 1986

Galil – Giancarlo 1986

Trie

• A tree representing a set of strings.

ab

c

e

e

f

d b

f

e g

{ aeef ad bbfe bbfg c }

Trie (Cont)

• Assume no string is a prefix of another

ab

c

e

e

f

d b

f

e g

Each string corresponds to a leaf.

Compressed Trie • Compress unary nodes, label edges by strings

ab

c

e

e

f

d b

f

e g

a

bbf

c

eefd

e g

Suffix tree

Suffix tree of string s:a compressed trie of all suffixes of s

Prefix-free: add a special character, say $, at the end of s

Suffix tree (Example) Let s = abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$

{ $ b$ ab$ bab$ abab$ }

ab

ab$

ab$

b

$

$

$

Suffix Tree properties

- Succint in space - O(n).

- Can be built in O(n) time. McCreight, Weiner,

Ukkonen, Farach-Colton

b

12

ab

a

b$

a

b$

3

$ 4

$

5

$

Exact string matching

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

Given a pattern P = ab we traverse the tree according to the pattern.

s=abab$

Exact string matching

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

Leaves correspond to locations of appearance!

s=abab$ 1 3

Exact string matching

12

ab

ab

$

ab$

b

3

$ 4

$

5

$

Prepare Tree: O(n) time

Find matches: O(m + occ) time occ = # of matches

s=abab$ 1 3

Lowest common ancestors

A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

s = abbaab$

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaa

b$

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaa

b$

s = abbaab$ aab$

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaa

b$

s = abbaab$ aab$ abbaab$

Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes

1

3

a

b

aab

ab$

b

5

$

2

b

4

b$a

6

$

7

$

b

$

aaa

b$

s = abbaab$

aab$ abbaab$

LCA/LCP propertiesa

1

3

b

aa

b

ab$

b

5

$

2

b

4

b$

a6

$

7

$

b

$

aa

ab

$

Preprocesssing time : O(n)

Query Time: O(1)

Harel & Tarjan 1984, Schieber & Vishkin 1988, Berkman & Vishkin 1993

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iFinding LCP(s, P0, Ti)

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iLength of LCP(s, P0, Ti) = 4

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iKangrooing distance = LCP(s, P0, Ti) +1 = 5

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iFinding LCP(s, P5, Ti+5)

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iLength of LCP(s, P5, Ti+5) = 2

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iKangrooing distance = LCP(s, P5, Ti+5) +1 = 3

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iFinding LCP(s, P8, Ti+8)

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iLength of LCP(s, P8, Ti+8) = 3

The Kangaroo Method(for k-mismatches)

- Create suffix tree for: s = P#T$

-Check P at each location i of T by kangrooing

Example:

P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … iNext iteration: i = i + 1

The Kangaroo Method(for k-mismatches)

Preprocess:

Build suffix tree of both P and T - O(n+m) timeLCA preprocessing - O(n+m) time

Check P at given text location

Kangroo jump till next mismatch - O(k) time

Overall time for naïve approach: O(nk)

2004/11/22 Hsing-Yen Ann

Faster Algorithms for Four Different Cases

Large alphabet At least 2k different alphabets in pattern P. O(n)

Small alphabet At most different alphabets in pattern P.

General alphabets - many frequent symbols At least frequent symbols

General alphabets - few frequent symbols Less than frequent symbols

k2

mknO log

mknO log

mknO log

k

k

top related