Top Banner
CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms
56

CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Jan 01, 2016

Download

Documents

Louise Elliott
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

CS5263 Bioinformatics

Lecture 17

Exact String Matching Algorithms

Page 2: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Boyer – Moore algorithm

• Three ideas:– Right-to-left comparison– Bad character rule– Good suffix rule

Page 3: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Boyer – Moore algorithm

• Right to left comparison

x

y

y

Skip some chars without missing any occurrence.

Page 4: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Extended bad character rule

char Position in P

a 6, 3

b 7, 4

p 2

t 1

x 5

T: xpbctbxabpqqaabpqz

P: tpabxab *^^

P: tpabxab

Find T(k) in P that is immediately left to i, shift P to align T(k) with that position

k

i = 5 5 – 3 = 2. so shift 2

Preprocessing O(n)

Restart the comparison here.

Page 5: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

(Strong) good suffix rule

tx

tyt’

tyt’

In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, such that the char left to t ≠ the char left to t’

T

P

P

z

z

z ≠ y

tyt’P zt’z

tyt’P zt’z

txT

Page 6: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Example preprocessing

qcabdabdab

char Positions in P

a 9, 6, 3

b 10, 7, 4

c 2

d 8, 5

q 1

q c a b d a b d a b1 2 3 4 5 6 7 8 9 10

0 0 0 0 2 0 0 2 0 0dabcab

Bad char rule Good suffix rule

dabdabcabdab

Where to shift depends on T Does not depend on T

Page 7: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Tricky case

Pattern: abcab

a b c a b0 0 0 1 0

* ^ ^

T: x y a a b c a b

shift = 4 – 1 = 3

a b c a bN N 0 N N

c

b

c

b

i-L

Page 8: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Example preprocessing

qcabdabdab

char Positions in P

a 9, 6, 3

b 10, 7, 4

c 2

d 8, 5

q 1

q c a b d a b d a b1 2 3 4 5 6 7 8 9 10

0 0 0 0 0 3 0 0 3 0dabcab

Bad char rule Good suffix rule

Where to shift depends on T Does not depend on T

dabdabcabdab

Page 9: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Example preprocessing

qcabdabdab

char Positions in P

a 9, 6, 3

b 10, 7, 4

c 2

d 8, 5

q 1

q c a b d a b d a b1 2 3 4 5 6 7 8 9 10

N N N N 2 N N 2 N Ndabcab

Bad char rule Good suffix rule

dabdabcabdab

Where to shift depends on T Does not depend on T

Page 10: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Algorithm KMP: Basic idea

tt’P

t xT

y

tt’P y

z

z

In pre-processing: for any position i in P, find the longest suffix t, such that t = t’, and y ≠ z.For each i, let Sp’(i) = length(t)

ij

Page 11: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Failure link

P: aataac

a a t a a c

Sp’(i) 0 1 0 0 2 0

aaat

aataac

If a char in T fails to match at pos 6, re-compare it with the

char at pos 3

Page 12: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

FSA

P: aataac

1 2 3 4 50a a t a a c

6

a

t

All other input goes to state 0

Sp’(i) 0 1 0 0 2 0

aaat

aataac

If the next char in T is t, we go to state 3

Page 13: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Tricky case

Pattern: abcab

a b c a b

0 0 0 0 2

a b bc a

c

Failure link

FSA

dummy

Page 14: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

How to actually do pre-processing?

• Similar pre-processing for KMP and B-M– Find matches between a suffix and a prefix

– Both can be done in linear time– P is usually short, even a more expensive

pre-processing may result in a gain overall

tt’P yxKMP

tyt’P xB-M

i

ij

j For each i, find a j. similar to DP. Start from i = 2

Page 15: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Fundamental pre-processing

• Zi: length of longest substring starting at i that matches a prefix of P– i.e. t = t’, x ≠ y, Zi = |t|– With the Z-values computed, we can get the

preprocessing for both KMP and B-M in linear time.

aabcaabxaazZ = 01003100210

• How to compute Z-values in linear time?

tt’Pi

x yi+zi-1zi1

Page 16: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Computing Z in Linear time

tt’Pl

x yrk

We already computed all Z-values up to k-1. need to compute Zk. We also know the starting and ending points of the previous match, l and r.

tt’Pl

x yrk

We know that t = t’, therefore the Z-value at k-l+1 may be helpful to us.

1

k-l+1

Page 17: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Computing Z in Linear time

• No char inside the box is compared twice. At most one mismatch per iteration.• Therefore, O(n).

Pk

The previous r is smaller than k. i.e., no previous match extends beyond k. do explicit comparison.

Pl

x yrk

Zk-l+1 <= r-k+1. Zk = Zk-l+1 No comparison is needed.1

k-l+1

Case 1:

Case 2:

Pl rk

Zk-l+1 > r-k+1. Zk = Zk-l+1

Comparison start from r1

k-l+1

Case 3:

Page 18: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Z-preprocessing for B-M and KMP

• Both KMP and B-M preprocessing can be done in O(n)

tt’i

x y

j = i+zi-1zi1

tt’ yxKMP

tyt’xB-Mij

Z j

ijFor each j sp’(j+zj-1) = z(j)

Use Z backwards

Page 19: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Keyword tree for spell checking

• O(n) time to construct. n: total length of patterns.• Search time: O(m). m: length of word• Common prefix only need to be compared once.

p

o

t

a

t

o

e

tr

y

t

er

y

s

c

i

e

n

c

e

h o o l

1

2

3

4

5

Page 20: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Aho-Corasick algorithm

• Generalizing KMP

• Create failure links

• Basis of the fgrep algorithm

• Given the following patterns:– potato– tattoo– theater– other

Page 21: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Failure link

p

o

t

a

t

o

t

e

r

0t

he

r

1

2 3

4

a

t

t

o

o

h

a

t

e

potterisapersonwhomakespottery

Page 22: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Failure link

p

o

t

a

t

o

t

e

r

0t

he

r

1

2 3

4

a

t

t

o

o

h

a

t

e

O(n) preprocessing, and O(m+k) searching. k is # of occurrence.

Can create a FSA similarly. Requires more space, and preprocessing time depends on alphabet size.

Page 23: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

A problem with failure link

• Patterns: {potato, other, pot}

p

o

t

a

t

o

0t

he

r

1

23

Page 24: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

A problem with failure link for multiple patterns

• Patterns: {potato, other, pot, the, he, era}

p

o

t

a

t

o

0t

he

r

1

2

t

h

e3

4

potherarac

h e 5er

a

Page 25: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Output link

• Patterns: {potato, other, pot, the}

p

o

t

a

t

o

0t

he

r

1

2

t

h

e3

4

potherarac

h e

Failure link: taken when a mismatch occurs. Output link: always taken. (but will return).

5er

a

Page 26: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Suffix Tree

• All algorithms we talked about so far preprocess pattern(s)– Karp-Rabin: small pattern, small alphabet– Boyer-Moore: fastest in practice. O(m) worst case.– KMP: O(m)– Aho-Corasick: O(m)

• In some cases we may prefer to pre-process T– Fixed T, varying P

• Suffix tree: basically a keyword tree of all suffixes

Page 27: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Suffix tree

• T: xabxac

• Suffixes:1. xabxac

2. abxac

3. bxac

4. xac

5. ac

6. c

a

bx

ac

bxa

c

c

c

x a b x a cc 1

2 3

4

5

6

Naïve construction: O(m2) using Aho-Corasick.

Smarter: O(m). Very technical. big constant factor

Create an internal node only when there is a branch

Page 28: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Suffix tree implementation

• Explicitly labeling seq end

• T: xabxa T: xabxa$

a

bx

a

bxa

x a b x a1

2 3

a

bx

a

bxa

x a b x a1

2 3

$

$$

$

$4

5

Page 29: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Suffix tree implementation

• Implicitly labeling edges

• T: xabxa$

a

bx

a

bxa

x a b x a1

2 3

$

$$

$

$4

5

2:2

3:$ 3:$

1

2 3

$

$4

5

1:23:$

Page 30: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Suffix links

• Similar to failure link in a keyword tree

• Only link internal nodes having branchesx

ab

cd

ef

g

h

ij

ab

c

de

fg

h

i

j

xabcff

Page 31: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Suffix tree construction

1:$

1

1234567890...acatgacatt...

Page 32: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Suffix tree construction

2:$

2

1:$

1

1234567890...acatgacatt...

Page 33: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Suffix tree construction

2:$

a

4:$

2

3

2:$

1

1234567890...acatgacatt...

Page 34: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Suffix tree construction

2:$

2

4:$

4

a

4:$

3

2:$

1

1234567890...acatgacatt...

Page 35: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Suffix tree construction

2:$

2

4:$

4

5:$ 5a

4:$

3

2:$

1

1234567890...acatgacatt...

Page 36: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Suffix tree construction

2:$

2

4:$

4

5:$

ca

tt

5

6

a

4:$

3

5:$

1

$

1234567890...acatgacatt...

Page 37: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Suffix tree construction

• With this suffix link, when we later need to add another suffix, say acaty, we can use the link to avoid going back to the root and re-compare “cat”

5:$

2

4:$

4

5:$ 5cat

t

7

ca

t

t

6

a

4:$

3

5:$

1

$

1234567890...acatgacatt...

Page 38: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Suffix tree construction

5:$

2

4:$

4

5:$ 5cat

t

7

ca

t

t

6

a

5:$

3

5:$

1

t

8

t

$

1234567890...acatgacatt...

Page 39: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Suffix tree construction

5:$

2

5:$

4

5:$ 5cat

t

7

ca

t

t

6

a

5:$

3

5:$

1

t

8

tt

t

9

$

1234567890...acatgacatt...

Page 40: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Suffix tree construction

5:$

2

5:$

4

5:$ 5cat

t

7

ca

t

t

6

a

5:$

3

5:$

1

t

8

tt

t

9

10$

$

1234567890...acatgacatt...

Page 41: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

ST Application: pattern matching

• Find all occurrence of P=xa in T– Find node v in the ST

that matches to P– Traverse the subtree

rooted at v to get the locations

a

bx

ac

bxa

c

c

c

x a b x a cc 1

2 3

4

5

6

T: xabxac

O(m) to construct ST (large constant factor)

O(n) to find v – linear to length of P instead of T!

O(k) to get all leaves, k is the number of occurrence.

Page 42: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

ST application: repeats finding

• Genome contains many repeated DNA sequences

• Repeat sequence length: Varies from 1 nucleotide to whole gene– Highly repetitive DNA in some non-coding

regions • 6 to 10bp x 100,000 to 1,000,000 times

– Genes may have multiple copies (50 to 10,000)

Page 43: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Find longest repeated substring

• Do a tree traversal, compute the lengths of labels at each node

• O(m)

L = 4

2:5

6:1015:1

8

L = 9

L = 8

Page 44: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Repeats finding

• Find all repeats that are at least k-residue long and appear at least p times in the seq– Phase 1: top-down, count lengths of labels at

each node– Phase 2: bottom-up: count # of leaves

descended from each internal node

(L, N)

For each node with L >= k, and N >= p, print all leaves

O(m) to traverse tree

Page 45: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Repeats finding

• Find repeats with at least 3 bases and 2 occurrence– cat– acat– aca

5:e

2

5:e

4

1234567890acatgacatt

5:e 5ca

t

t

7

ca

t

t

6

a

5:e

3

5:e

1

t

8

tt

t

9

10$

Page 46: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Repeats finding

1. Left-maximal repeat– S[i+1..i+k] = S[j+1..j+k]– S[i] != S[j]

2. Right-maximal repeat– S[i+1..i+k] = S[j+1..j+k], – S[i+k+1] != S[j+k+1]

3. Maximal repeat– S[i+1..i+k] = S[j+1..j+k]– S[i] != S[j], and S[i+k+1] != S[j+k+1]

acatgacatt

1. aca2. cat3. acat

Page 47: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Repeats finding

• How to find maximal repeat?– A right-maximal repeats with different left chars

5:e

2

5:e

4

1234567890acatgacatt

5:e 5cat

t

7

ca

t

t

6

a

5:e

3

5:e

1

t

8

tt

t

9

10$

Left char = [] g c c a a

Page 48: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

ST application: word enumeration

• Find all k-mers that occur at least p times– Compute (L, N) for each

node– Find nodes v with L>=k,

and L(parent)<k, and N>=y

– Traverse sub-tree rooted at v to get the locations

L<k

L>=k, N>=p

L = KL=k

This can be used in many applications. For example, to find words that appeared frequently in a genome or a document

Page 49: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Joint Suffix Tree

• Build a ST for many than two strings

• Two strings S1 and S2

• S* = S1 & S2

• Build a suffix tree for S* in time O(|S1| + |S2|)

• The separator will only appear in the edge ending in a leaf

Page 50: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

• S1 = abcd

• S2 = abca

• S* = abcd&abca$a

bcd

&ab

ca

bc

d&abca

c

d&

abc

d

d & ab c

d

& a b c d

a aa

$

1,1

2,1

1,2

1,3

1,4

2,2

2,32,4

Page 51: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

To Simplify

• We don’t really need to do anything, since all edge labels were implicit.

• The right hand side is more convenient to look at

abc

d&

abc

a

bc

d&abca

c

d&

abc

d

d & ab c

d

& a b c d

a aa

$

1,1

2,1

1,2

1,3

1,4

2,2

2,32,4

uselessa

bcd

bc

d

c

d

d

a aa

$

1,12,1

1,21,3

1,4

2,2

2,32,4

Page 52: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Application of JST

• Longest common substring– For each internal node v,

keep a bit vector B[2]– B[1] = 1 if a child of v is a

suffix of S1– Find all internal nodes with

B[1] = B[2] = 1– Report one with the longest

label– Can be extended to k

sequences. Just use a longer bit vector.

abc

d

bc

d

c

d

d

a aa

$

1,12,1

1,21,3

1,4

2,2

2,32,4

O(m), m the total seq length

Page 53: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Application of JST

• Given K strings, find all substrings with L>=l, that appear in at least d strings

• Exact motif finding problem

• Build a joint suffix tree with all strings

S* = S1 & S2 % S3 * S4 @ S5 ! S6 + S7

– Use a unique end char for each string– Not really necessary if caution is taken in

construction

Page 54: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

L< k

L >= k B = 1010 | 0011 = 1011

|B| = 3

1,x3,x 3,x

4,x

B = 0011

O(mK), m the total seq length. K is for “bitwise or” two bit vectors

3,x

B = 1010

Page 55: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Many other applications

• Reproduce the behavior of Aho-Corasick• DNA finger printing

– A database of people’s DNA sequence– Given a short DNA, which person is it from?

• Recognizing DNA contamination• Indexing sequence databases• …• Catch

– Large constant factor for space requirement (15-40 bytes per base for DNA)

– Large constant factor for construction– Suffix array: trade off time for space

Page 56: CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Summary

• One T, one P– Boyer-Moore is the choice– KMP works but not the best

• One T, many P– Aho-Corasick– Suffix Tree

• One fixed T, many varying P– Suffix tree

• Two or more T’s– Suffix tree, joint suffix tree, suffix array

Alphabet independent

Alphabet dependent