Top Banner
Advanced Algorithms – COMS31900 Pattern matching part four Pattern matching with at most k mismatches Benjamin Sach
135

Pattern Matching Part Two: k-mismatches

Feb 16, 2017

Download

Education

Benjamin Sach
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Pattern Matching Part Two: k-mismatches

Advanced Algorithms – COMS31900

Pattern matching part four

Pattern matching with at most k mismatches

Benjamin Sach

Page 2: Pattern Matching Part Two: k-mismatches

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c

a b d

a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

m

i.e. the number of distinct j such that P [j] 6= T [i+ j]

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

Page 3: Pattern Matching Part Two: k-mismatches

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c

a b d

a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

m

i.e. the number of distinct j such that P [j] 6= T [i+ j]

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a

Ham(4) = 1

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

Page 4: Pattern Matching Part Two: k-mismatches

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

i.e. the number of distinct j such that P [j] 6= T [i+ j]

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(5) = 4

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

Page 5: Pattern Matching Part Two: k-mismatches

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

i.e. the number of distinct j such that P [j] 6= T [i+ j]

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(6) = 1

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

Page 6: Pattern Matching Part Two: k-mismatches

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

i.e. the number of distinct j such that P [j] 6= T [i+ j]

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(7) = 3

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

Page 7: Pattern Matching Part Two: k-mismatches

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

i.e. the number of distinct j such that P [j] 6= T [i+ j]

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(7) = 3

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

this is alignment 7

Page 8: Pattern Matching Part Two: k-mismatches

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

i.e. the number of distinct j such that P [j] 6= T [i+ j]

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(8) = 3

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

this is alignment 8

Page 9: Pattern Matching Part Two: k-mismatches

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

i.e. the number of distinct j such that P [j] 6= T [i+ j]

Last lecture we saw two algorithms for this problem:

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(8) = 3

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

this is alignment 8

Page 10: Pattern Matching Part Two: k-mismatches

Pattern matching with mismatches

T

Input: A text string T (length n) and a pattern string P (lengthm)

P

ba b c a a d ad a

Goal: For every alignment i, output

The Hamming distance is the number of mismatches. . .

c a a

i.e. the number of distinct j such that P [j] 6= T [i+ j]

Last lecture we saw two algorithms for this problem:

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Ham(8) = 3

Ham(i), the Hamming distance between P and T [i . . . i+m− 1]

this is alignment 8

One algorithm takesO(n|Σ| logm) time (where |Σ| is the alphabet size)

The other algorithm takesO(n√m logm) time (regardless of the alphabet size)

Page 11: Pattern Matching Part Two: k-mismatches

Pattern matching with few mismatches (k-mismatch)

T

Input A text string T (length n), a pattern string P (lengthm) and a positive integer k

P

ba b c

a b d

a a d ad a

Goal: For all i, output,

a a a

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a

Output the number of mismatches. . . unless it’s more than k

(we interpret the outputX to mean “too many mismatches”)

Hamk(i) =

Ham(i) if Ham(i) 6 k

X if Ham(i) > k

Page 12: Pattern Matching Part Two: k-mismatches

Pattern matching with few mismatches (k-mismatch)

T

Input A text string T (length n), a pattern string P (lengthm) and a positive integer k

P

ba b c

a b d

a a d ad a

Goal: For all i, output,

a a a

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a

Output the number of mismatches. . . unless it’s more than k

(we interpret the outputX to mean “too many mismatches”)

Hamk(i) =

Ham(i) if Ham(i) 6 k

X if Ham(i) > k

k = 2

Page 13: Pattern Matching Part Two: k-mismatches

Pattern matching with few mismatches (k-mismatch)

T

Input A text string T (length n), a pattern string P (lengthm) and a positive integer k

P

ba b c

a b d

a a d ad a

Goal: For all i, output,

a a a

m

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a

Hamk(4) = 1

Output the number of mismatches. . . unless it’s more than k

(we interpret the outputX to mean “too many mismatches”)

Hamk(i) =

Ham(i) if Ham(i) 6 k

X if Ham(i) > k

k = 2

Page 14: Pattern Matching Part Two: k-mismatches

Pattern matching with few mismatches (k-mismatch)

T

Input A text string T (length n), a pattern string P (lengthm) and a positive integer k

P

ba b c a a d ad a

Goal: For all i, output,

a a a

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Hamk(5) = X

Output the number of mismatches. . . unless it’s more than k

(we interpret the outputX to mean “too many mismatches”)

Hamk(i) =

Ham(i) if Ham(i) 6 k

X if Ham(i) > k

k = 2

Page 15: Pattern Matching Part Two: k-mismatches

Pattern matching with few mismatches (k-mismatch)

T

Input A text string T (length n), a pattern string P (lengthm) and a positive integer k

P

ba b c a a d ad a

Goal: For all i, output,

a a a

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Hamk(6) = 1

Output the number of mismatches. . . unless it’s more than k

(we interpret the outputX to mean “too many mismatches”)

Hamk(i) =

Ham(i) if Ham(i) 6 k

X if Ham(i) > k

k = 2

Page 16: Pattern Matching Part Two: k-mismatches

Pattern matching with few mismatches (k-mismatch)

T

Input A text string T (length n), a pattern string P (lengthm) and a positive integer k

P

ba b c a a d ad a

Goal: For all i, output,

a a a

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Hamk(7) = 2

Output the number of mismatches. . . unless it’s more than k

(we interpret the outputX to mean “too many mismatches”)

Hamk(i) =

Ham(i) if Ham(i) 6 k

X if Ham(i) > k

k = 2

Page 17: Pattern Matching Part Two: k-mismatches

Pattern matching with few mismatches (k-mismatch)

T

Input A text string T (length n), a pattern string P (lengthm) and a positive integer k

P

ba b c a a d ad a

Goal: For all i, output,

a a a

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Hamk(8) = X

Output the number of mismatches. . . unless it’s more than k

(we interpret the outputX to mean “too many mismatches”)

Hamk(i) =

Ham(i) if Ham(i) 6 k

X if Ham(i) > k

k = 2

Page 18: Pattern Matching Part Two: k-mismatches

Pattern matching with few mismatches (k-mismatch)

T

Input A text string T (length n), a pattern string P (lengthm) and a positive integer k

P

ba b c a a d ad a

Goal: For all i, output,

a a a

• We could use theO(n√m logm) time algorithm for Hamming distance. . .

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Hamk(8) = X

Output the number of mismatches. . . unless it’s more than k

(we interpret the outputX to mean “too many mismatches”)

Hamk(i) =

Ham(i) if Ham(i) 6 k

X if Ham(i) > k

k = 2

Page 19: Pattern Matching Part Two: k-mismatches

Pattern matching with few mismatches (k-mismatch)

T

Input A text string T (length n), a pattern string P (lengthm) and a positive integer k

P

ba b c a a d ad a

Goal: For all i, output,

a a a

• We could use theO(n√m logm) time algorithm for Hamming distance. . .

but when k is small we can do much better

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Hamk(8) = X

Output the number of mismatches. . . unless it’s more than k

(we interpret the outputX to mean “too many mismatches”)

Hamk(i) =

Ham(i) if Ham(i) 6 k

X if Ham(i) > k

k = 2

Page 20: Pattern Matching Part Two: k-mismatches

LCP - the Longest Common Prefix

T ba b c a b aa b cb a b

For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that

T [i . . . i+ `− 1] = P [j . . . j + `− 1]

it’s the furthest you can go before hitting a mismatch

a b c b a bc bP

nn nm

Page 21: Pattern Matching Part Two: k-mismatches

LCP - the Longest Common Prefix

T ba b c a b aa b cb a b

For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that

T [i . . . i+ `− 1] = P [j . . . j + `− 1]

it’s the furthest you can go before hitting a mismatch

a b c b a bc bP

nn nm

Page 22: Pattern Matching Part Two: k-mismatches

LCP - the Longest Common Prefix

T ba b c a b aa b cb a b

For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that

T [i . . . i+ `− 1] = P [j . . . j + `− 1]

it’s the furthest you can go before hitting a mismatch

i j

a b c b a bc bP

nn nm

Page 23: Pattern Matching Part Two: k-mismatches

LCP - the Longest Common Prefix

T ba b c a b aa b cb a b

For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that

T [i . . . i+ `− 1] = P [j . . . j + `− 1]

it’s the furthest you can go before hitting a mismatch

i j

a b c b a bc bP

nn nm

LCP(i, j) returns 3

Page 24: Pattern Matching Part Two: k-mismatches

LCP - the Longest Common Prefix

T ba b c a b aa b cb a b

For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that

T [i . . . i+ `− 1] = P [j . . . j + `− 1]

it’s the furthest you can go before hitting a mismatch

i j

a b c b a bc bP

nn nm

LCP(i, j) returns 4

Page 25: Pattern Matching Part Two: k-mismatches

LCP - the Longest Common Prefix

T ba b c a b aa b cb a b

For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that

T [i . . . i+ `− 1] = P [j . . . j + `− 1]

it’s the furthest you can go before hitting a mismatch

a b c b a bc bP

nn nm

i jLCP(i, j) returns 0

Page 26: Pattern Matching Part Two: k-mismatches

LCP - the Longest Common Prefix

T ba b c a b aa b cb a b

For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that

T [i . . . i+ `− 1] = P [j . . . j + `− 1]

it’s the furthest you can go before hitting a mismatch

i j

a b c b a bc bP

nn nm

LCP(i, j) returns 4

Page 27: Pattern Matching Part Two: k-mismatches

LCP - the Longest Common Prefix

T ba b c a b aa b cb a b

For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that

T [i . . . i+ `− 1] = P [j . . . j + `− 1]

it’s the furthest you can go before hitting a mismatch

i j

a

We can preprocess P and T for LCP queries inO(n) time andO(n) space

b c b a bc bP

nn nm

LCP(i, j) returns 4

Page 28: Pattern Matching Part Two: k-mismatches

LCP - the Longest Common Prefix

T ba b c a b aa b cb a b

For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that

T [i . . . i+ `− 1] = P [j . . . j + `− 1]

it’s the furthest you can go before hitting a mismatch

i j

a

We can preprocess P and T for LCP queries inO(n) time andO(n) space

Each query then takesO(1) time

b c b a bc bP

nn nm

LCP(i, j) returns 4

Page 29: Pattern Matching Part Two: k-mismatches

LCP - the Longest Common Prefix

T ba b c a b aa b cb a b

For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that

T [i . . . i+ `− 1] = P [j . . . j + `− 1]

it’s the furthest you can go before hitting a mismatch

i j

a

We can preprocess P and T for LCP queries inO(n) time andO(n) space

Each query then takesO(1) time

we’ll see how to do this later in this lecture

b c b a bc bP

nn nm

LCP(i, j) returns 4

Page 30: Pattern Matching Part Two: k-mismatches

LCP - the Longest Common Prefix

T ba b c a b aa b cb a b

For any pair of locations i in T and j in P , LCP(i, j) is the largest ` such that

T [i . . . i+ `− 1] = P [j . . . j + `− 1]

it’s the furthest you can go before hitting a mismatch

i j

a

We can preprocess P and T for LCP queries inO(n) time andO(n) space

Each query then takesO(1) time

we’ll see how to do this later in this lecture

we can use LCP queries to solve the k-mismatch problem. . .

b c b a bc bP

nn nm

First let’s see how

LCP(i, j) returns 4

Page 31: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

a a b a

Page 32: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

a a b a

Page 33: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

a a b a

Page 34: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

a a b a

Page 35: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

a a b a

Page 36: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b

i

a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

a a b a

Page 37: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b

i

j

a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

a a b a

Page 38: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b

i

j

a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

perform LCP(i, j)

a a b a

Page 39: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b

i

j

a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

perform LCP(i, j)

a a b a

Page 40: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b

i

j

a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

perform LCP(i, j)

a a b a which returns 3

Page 41: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b

i

j

a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

perform LCP(i, j)

a a b a which returns 3

Page 42: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b

i

j

a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

perform LCP(i, j)

a a b a which returns 3

mismatch!

Page 43: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b

i

j

a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

a a b a

Page 44: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b

i

j

a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

a a b a

skip over this!

Page 45: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

i′

j

a a b a

skip over this!

Page 46: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

i′

j

perform LCP(i′, j)

a a b a

Page 47: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

i′

j

perform LCP(i′, j)

a a b a which returns 0

Page 48: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

i′

j

a a b a

Page 49: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

i′

j

a a b a

Page 50: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

perform LCP(i′, j)i′

j

a a b a

Page 51: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

perform LCP(i′, j)i′

j

a a b a which returns 2

Page 52: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

i′

j

a a b a

Page 53: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

perform LCP(i′, j)

a a b a

i′

j

Page 54: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

perform LCP(i′, j)

a a b a

i′

j

which returns 3

Page 55: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

(or we reach the end of P )

perform LCP(i′, j)

a a b a

i′

j

which returns 3

Page 56: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

(or we reach the end of P )

a a b a

i′

j

Page 57: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

(or we reach the end of P )

We can therefore calculate Hamk(i) for a single alignment i inO(k) time

a a b a

i′

j

Page 58: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

(or we reach the end of P )

We can therefore calculate Hamk(i) for a single alignment i inO(k) time

Overall this takesO(nk) time (including the ‘preprocessing’ for LCP queries)

a a b a

i′

j

Page 59: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

(or we reach the end of P )

We can therefore calculate Hamk(i) for a single alignment i inO(k) time

Overall this takesO(nk) time (including the ‘preprocessing’ for LCP queries)

this is pretty good but we can do better

a a b a

i′

j

Page 60: Pattern Matching Part Two: k-mismatches

k-mismatch using LCP queries

a

b

c

aT a b c a b aa b cb b a

b c b a bP

nn

Find the leftmost (at most) k + 1 mismatches between T [i . . . i+m− 1] and P

(we do this for each i seperately)

We can do this using (at most) k + 1 LCP queries

each query takesO(1) time and finds a new mismatch

(or we reach the end of P )

We can therefore calculate Hamk(i) for a single alignment i inO(k) time

Overall this takesO(nk) time (including the ‘preprocessing’ for LCP queries)

this is pretty good but we can do better but first. . . how do we answer those LCP queries?

a a b a

i′

j

Page 61: Pattern Matching Part Two: k-mismatches

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries

1

5

inO(n) prep. time and space

Page 62: Pattern Matching Part Two: k-mismatches

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

What is the LCA of the leaves representing suffixes i and j?

Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries

1

5

inO(n) prep. time and space

Page 63: Pattern Matching Part Two: k-mismatches

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

What is the LCA of the leaves representing suffixes i and j?

Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries

1

5

i j3

inO(n) prep. time and space

Page 64: Pattern Matching Part Two: k-mismatches

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

What is the LCA of the leaves representing suffixes i and j?

Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries

1

5

i j

LCA(i, j)

3

inO(n) prep. time and space

Page 65: Pattern Matching Part Two: k-mismatches

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

What is the LCA of the leaves representing suffixes i and j?

Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries

1

5

i j

LCA(i, j)

3

inO(n) prep. time and space

Page 66: Pattern Matching Part Two: k-mismatches

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

What is the LCA of the leaves representing suffixes i and j?

Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries

1

5

i j

LCA(i, j)

3

it’s the node representing the longest common prefix of T [i . . . n− 1] and T [j . . . n− 1]

inO(n) prep. time and space

Page 67: Pattern Matching Part Two: k-mismatches

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

What is the LCA of the leaves representing suffixes i and j?

Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries

1

5

i j

LCA(i, j)

3

it’s the node representing the longest common prefix of T [i . . . n− 1] and T [j . . . n− 1]

inO(n) prep. time and space

For any pair of locations i, j in T , LCPT (i, j) is the largest ` such that

T [i . . . i+ `− 1] = T [j . . . j + `− 1]

Single string LCP:

Page 68: Pattern Matching Part Two: k-mismatches

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

What is the LCA of the leaves representing suffixes i and j?

Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries

1

5

it’s the node representing the longest common prefix of T [i . . . n− 1] and T [j . . . n− 1]

LCA(i, j)

j

1

5

inO(n) prep. time and space

i

For any pair of locations i, j in T , LCPT (i, j) is the largest ` such that

T [i . . . i+ `− 1] = T [j . . . j + `− 1]

Single string LCP:

Page 69: Pattern Matching Part Two: k-mismatches

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

What is the LCA of the leaves representing suffixes i and j?

Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries

1

5

it’s the node representing the longest common prefix of T [i . . . n− 1] and T [j . . . n− 1]

LCA(i, j)

j

4

i

2

inO(n) prep. time and space

For any pair of locations i, j in T , LCPT (i, j) is the largest ` such that

T [i . . . i+ `− 1] = T [j . . . j + `− 1]

Single string LCP:

Page 70: Pattern Matching Part Two: k-mismatches

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

What is the LCA of the leaves representing suffixes i and j?

Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries

1

5

it’s the node representing the longest common prefix of T [i . . . n− 1] and T [j . . . n− 1]

LCA(i, j)

j

4

i

2

inO(n) prep. time and space

Page 71: Pattern Matching Part Two: k-mismatches

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

What is the LCA of the leaves representing suffixes i and j?

Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries

1

5

it’s the node representing the longest common prefix of T [i . . . n− 1] and T [j . . . n− 1]

LCA(i, j)

j

4

i

2

We store the root-to-node length at each internal nodeso we can recover the length, LCPT (i, j) inO(1) time

inO(n) prep. time and space

Page 72: Pattern Matching Part Two: k-mismatches

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

What is the LCA of the leaves representing suffixes i and j?

Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries

1

5

it’s the node representing the longest common prefix of T [i . . . n− 1] and T [j . . . n− 1]

LCA(i, j)

j

4

i

2

We store the root-to-node length at each internal node

1

3

0

2

so we can recover the length, LCPT (i, j) inO(1) time

inO(n) prep. time and space

Page 73: Pattern Matching Part Two: k-mismatches

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

What is the LCA of the leaves representing suffixes i and j?

Build the suffix tree for T and preprocess it for LCA (Lowest Common Ancestor) queries

1

5

it’s the node representing the longest common prefix of T [i . . . n− 1] and T [j . . . n− 1]

LCA(i, j)

j

4

i

2

We store the root-to-node length at each internal node

1

3

0

2

so we can recover the length, LCPT (i, j) inO(1) time

inO(n) prep. time and space

So we haveO(n) space,O(n) prep. time andO(1) query time for the LCP problem on a single string

Page 74: Pattern Matching Part Two: k-mismatches

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

1

5

LCA(i, j)

j

4

i

2

1

3

0

2

So we haveO(n) space,O(n) prep. time andO(1) query time for the LCP problem on a single string

Page 75: Pattern Matching Part Two: k-mismatches

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

1

5

LCA(i, j)

j

4

i

2

1

3

0

2

So we haveO(n) space,O(n) prep. time andO(1) query time for the LCP problem on a single string

We can extend this two strings (P and T ) by first concatenating them together. . .(and proceeding as for a single string)

Page 76: Pattern Matching Part Two: k-mismatches

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

1

5

LCA(i, j)

j

4

i

2

1

3

0

2

So we haveO(n) space,O(n) prep. time andO(1) query time for the LCP problem on a single string

We can extend this two strings (P and T ) by first concatenating them together. . .

So we also haveO(n) space,O(n) prep. time andO(1) query time

(and proceeding as for a single string)

for the LCP problem on two strings

Page 77: Pattern Matching Part Two: k-mismatches

LCPs in Suffix Trees

TT b n aaa sn

n$

a

s$

nas$

nas$

s$

na

s$

bananas$

7$

b n aaa sn

n aaa sn

n aa sn

aa sn

a sn

a s

s

suffixes

$

$

$

$

$

$

$

0

1

2

3

4

5

6

$ 7 3

0

2 4

6nas$

This is the suffix tree

of this text

2

1

5

LCA(i, j)

j

4

i

2

1

3

0

2

So we haveO(n) space,O(n) prep. time andO(1) query time for the LCP problem on a single string

We can extend this two strings (P and T ) by first concatenating them together. . .

So we also haveO(n) space,O(n) prep. time andO(1) query time

(and proceeding as for a single string)

for the LCP problem on two strings

I.e. for any i in T and j in P , LCP(i, j) is the largest ` such that

T [i . . . i+ `− 1] = P [j . . . j + `− 1] (as we originally defined it)

Page 78: Pattern Matching Part Two: k-mismatches

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

and infrequent otherwise

Page 79: Pattern Matching Part Two: k-mismatches

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

P a d bc ab b da

0 1 2 3 4 5 6 7 8

and infrequent otherwise

k = 4

(√k = 2)

m = 9

Page 80: Pattern Matching Part Two: k-mismatches

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

P a d bc ab b da

0 1 2 3 4 5 6 7 8

and infrequent otherwise

k = 4

(√k = 2)

Page 81: Pattern Matching Part Two: k-mismatches

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent

and infrequent otherwise

k = 4

(√k = 2)

Page 82: Pattern Matching Part Two: k-mismatches

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent , b is frequent

and infrequent otherwise

k = 4

(√k = 2)

Page 83: Pattern Matching Part Two: k-mismatches

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent , b is frequent

and infrequent otherwise

k = 4

(√k = 2)

, d is frequent

Page 84: Pattern Matching Part Two: k-mismatches

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent , b is frequentc is infrequent

and infrequent otherwise

k = 4

(√k = 2)

, d is frequent

Page 85: Pattern Matching Part Two: k-mismatches

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent , b is frequentc is infrequent

and infrequent otherwise

How many frequent symbols can there be?

k = 4

(√k = 2)

, d is frequent

Page 86: Pattern Matching Part Two: k-mismatches

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent , b is frequentc is infrequent

and infrequent otherwise

How many frequent symbols can there be? Lots!

k = 4

(√k = 2)

, d is frequent

Page 87: Pattern Matching Part Two: k-mismatches

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent , b is frequentc is infrequent

and infrequent otherwise

How many frequent symbols can there be? Lots! there could be m√k

frequent symbols

k = 4

(√k = 2)

, d is frequent

Page 88: Pattern Matching Part Two: k-mismatches

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent , b is frequentc is infrequent

and infrequent otherwise

How many frequent symbols can there be? Lots! there could be m√k

frequent symbols

Case 1: There are fewer than 2√k frequent symbols in P .

k = 4

(√k = 2)

, d is frequent

Page 89: Pattern Matching Part Two: k-mismatches

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent , b is frequentc is infrequent

and infrequent otherwise

Algorithm summary

How many frequent symbols can there be? Lots! there could be m√k

frequent symbols

Case 1: There are fewer than 2√k frequent symbols in P .

k = 4

(√k = 2)

, d is frequent

Page 90: Pattern Matching Part Two: k-mismatches

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent , b is frequentc is infrequent

and infrequent otherwise

Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent

Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)

Stage 2: Count all matches involving infrequent symbols (as in last lecture)

How many frequent symbols can there be? Lots! there could be m√k

frequent symbols

Case 1: There are fewer than 2√k frequent symbols in P .

k = 4

(√k = 2)

, d is frequent

Page 91: Pattern Matching Part Two: k-mismatches

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent , b is frequentc is infrequent

and infrequent otherwise

Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent

Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)

Stage 2: Count all matches involving infrequent symbols (as in last lecture)

How many frequent symbols can there be? Lots! there could be m√k

frequent symbols

Case 1: There are fewer than 2√k frequent symbols in P .

-O(m logm) time

k = 4

(√k = 2)

, d is frequent

Page 92: Pattern Matching Part Two: k-mismatches

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent , b is frequentc is infrequent

and infrequent otherwise

Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent

Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)

Stage 2: Count all matches involving infrequent symbols (as in last lecture)

How many frequent symbols can there be? Lots! there could be m√k

frequent symbols

Case 1: There are fewer than 2√k frequent symbols in P .

-O(m logm) time

-O(n√k logm) time

k = 4

(√k = 2)

, d is frequent

Page 93: Pattern Matching Part Two: k-mismatches

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent , b is frequentc is infrequent

and infrequent otherwise

Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent

Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)

Stage 2: Count all matches involving infrequent symbols (as in last lecture)

How many frequent symbols can there be? Lots! there could be m√k

frequent symbols

Case 1: There are fewer than 2√k frequent symbols in P .

-O(m logm) time

-O(n√k logm) time

-O(n√k) time

k = 4

(√k = 2)

, d is frequent

Page 94: Pattern Matching Part Two: k-mismatches

k-mismatch using frequent/infrequent symbols

Definition: A symbol is frequent if it occurs at least√k times in P ,

P a d bc ab b da

0 1 2 3 4 5 6 7 8

a is frequent , b is frequentc is infrequent

and infrequent otherwise

Algorithm summary

Stage 0: Classify each symbol as frequent or infrequent

Stage 1: Count all matches involving frequent symbols (using cross-correlations as in last lecture)

Stage 2: Count all matches involving infrequent symbols (as in last lecture)

How many frequent symbols can there be? Lots! there could be m√k

frequent symbols

Case 1: There are fewer than 2√k frequent symbols in P .

-O(m logm) time

-O(n√k logm) time

-O(n√k) time

-O(n√k logm) total time

k = 4

(√k = 2)

, d is frequent

Page 95: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a d bc ab b da c bbfe0 1 2 3 4 5 6 7 8 9 10 11 12 13

Page 96: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a d bc ab b da c bbfe0 1 2 3 4 5 6 7 8 9 10 11 12 13

a is frequent , b is frequente and f are infrequent

, d is frequent, c is frequent

Page 97: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe0 1 2 3 4 5 6 7 8 9 10 11 12 13

a is frequent , b is frequente and f are infrequent

, d is frequent, c is frequent

Page 98: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe0 1 2 3 4 5 6 7 8 9 10 11 12 13

a is frequent , b is frequente and f are infrequent

, d is frequent, c is frequent

Page 99: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

0 1 2 3 4 5 6 7 8 9 10 11 12 13

a is frequent , b is frequente and f are infrequent

, d is frequent, c is frequent

Page 100: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Page 101: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

Page 102: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4

Page 103: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4 dk(i) = 3

Page 104: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4 dk(i) = 3

Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)

Page 105: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4 dk(i) = 3

Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)

because there are 2k interesting positions. . . and fewer than k of them match

Page 106: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4 dk(i) = 3

Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)

because there are 2k interesting positions. . . and fewer than k of them match

Fact There are at most n/√k values of i with dk(i) > k

Page 107: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4 dk(i) = 3

Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)

because there are 2k interesting positions. . . and fewer than k of them match

Fact There are at most n/√k values of i with dk(i) > k

this follows from a counting argument

Page 108: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4 dk(i) = 3

Fact There are at most n/√k values of i with dk(i) > k

Page 109: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4 dk(i) = 3

Fact There are at most n/√k values of i with dk(i) > k

For any location i′, T [i′] = P [j] for either 0 or√k distinct j ∈ J

Page 110: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4 dk(i) = 3

Fact There are at most n/√k values of i with dk(i) > k

For any location i′, T [i′] = P [j] for either 0 or√k distinct j ∈ J

This implies that∑

i dk(i) 6∑

i′∑

j∈J Eq(T [i′], P [j]) 6 n√k

Page 111: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4 dk(i) = 3

Fact There are at most n/√k values of i with dk(i) > k

For any location i′, T [i′] = P [j] for either 0 or√k distinct j ∈ J

This implies that∑

i dk(i) 6∑

i′∑

j∈J Eq(T [i′], P [j]) 6 n√k

Eq = 1 if T [i′] = P [j] (and

0 otherwise)

Page 112: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4 dk(i) = 3

Fact There are at most n/√k values of i with dk(i) > k

For any location i′, T [i′] = P [j] for either 0 or√k distinct j ∈ J

This implies that∑

i dk(i) 6∑

i′∑

j∈J Eq(T [i′], P [j]) 6 n√k

Page 113: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4 dk(i) = 3

Fact There are at most n/√k values of i with dk(i) > k

Assume that more than n/√k values of i have dk(i) > k

Page 114: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4 dk(i) = 3

Fact There are at most n/√k values of i with dk(i) > k

Assume that more than n/√k values of i have dk(i) > k

So∑

i dk(i) >(

n√k

+ 1)· k

Page 115: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4 dk(i) = 3

Fact There are at most n/√k values of i with dk(i) > k

Assume that more than n/√k values of i have dk(i) > k

So∑

i dk(i) >(

n√k

+ 1)· k > n

√k

Page 116: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4 dk(i) = 3

Fact There are at most n/√k values of i with dk(i) > k

Assume that more than n/√k values of i have dk(i) > k

So∑

i dk(i) >(

n√k

+ 1)· k > n

√k

Contradiction!

Page 117: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4 dk(i) = 3

Fact if dk(i) < k then there are more than k mismatches (i.e. Hamk(i) = X)

because there are 2k interesting positions. . . and fewer than k of them match

Fact There are at most n/√k values of i with dk(i) > k

this follows from a counting argument

Page 118: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4

We can filter the text, using the dk(i) values leaving only n/√k alignments to check

every other alignment has more than k mismatches

Page 119: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4

We can filter the text, using the dk(i) values leaving only n/√k alignments to check

every other alignment has more than k mismatches

Check each of the remaining alignments using LCP queries inO(k) timeper alignment

Page 120: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4

We can filter the text, using the dk(i) values leaving only n/√k alignments to check

every other alignment has more than k mismatches

Check each of the remaining alignments using LCP queries inO(k) time

This takes n/√k ·O(k) = O(n

√k) total time.

per alignment

Page 121: Pattern Matching Part Two: k-mismatches

Case 2: There are at least 2√k frequent symbols

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4

We can filter the text, using the dk(i) values leaving only n/√k alignments to check

every other alignment has more than k mismatches

Check each of the remaining alignments using LCP queries inO(k) time

How do we compute all the dk(i) values?

This takes n/√k ·O(k) = O(n

√k) total time.

per alignment

Page 122: Pattern Matching Part Two: k-mismatches

Computing all the dk(i) values

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4

For each text character T [i′],

For each j ∈ J such that P [j] = T [i′]

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

Let dk(i) = 0 for all i.

increase dk(i′ − j) by one

Page 123: Pattern Matching Part Two: k-mismatches

Computing all the dk(i) values

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4

For each text character T [i′],

For each j ∈ J such that P [j] = T [i′]

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

Let dk(i) = 0 for all i.

increase dk(i′ − j) by one

T [i′] = P [j] for either 0 or√k distinct j ∈ J

(store a list of j values for each symbol)

Page 124: Pattern Matching Part Two: k-mismatches

Computing all the dk(i) values

Pick any 2√k frequent symbols and for each symbol pick

√k occurences in P .

This gives us 2k interesting pattern locations, denoted J

k = 4d bc ab b daP a dcb b da c bbfe

J = {0, 2, 3, 4, 5, 7, 9, 10}

b da b d bfT b b c caa f e f cc c aa ab e e

i = 4

For each text character T [i′],

For each j ∈ J such that P [j] = T [i′]

Let dk(i) be the number of j ∈ J such that P [j] = T [i+ j]

i.e. the number of (single character) matches involving interesting pattern locations

Let dk(i) = 0 for all i.

increase dk(i′ − j) by one

T [i′] = P [j] for either 0 or√k distinct j ∈ J

(store a list of j values for each symbol)

This takesO(n√k) total time.

Page 125: Pattern Matching Part Two: k-mismatches

Pattern matching with k-mismatches: putting it all together

Algorithm summary

Page 126: Pattern Matching Part Two: k-mismatches

Pattern matching with k-mismatches: putting it all together

Algorithm summary

Preprocess P, T for LCP queries -O(n) time

Page 127: Pattern Matching Part Two: k-mismatches

Pattern matching with k-mismatches: putting it all together

Algorithm summary

Preprocess P, T for LCP queries -O(n) time

Count the number of frequent symbols in P -O(m logm) time

Page 128: Pattern Matching Part Two: k-mismatches

Pattern matching with k-mismatches: putting it all together

Algorithm summary

Preprocess P, T for LCP queries -O(n) time

Count the number of frequent symbols in P -O(m logm) time

Case 1: P has at most 2√k frequent symbols

Case 2: P has more than 2√k frequent symbols

Page 129: Pattern Matching Part Two: k-mismatches

Pattern matching with k-mismatches: putting it all together

Algorithm summary

Preprocess P, T for LCP queries -O(n) time

Count the number of frequent symbols in P -O(m logm) time

Case 1: P has at most 2√k frequent symbols

Count matches with frequent symbols using cross-correlations -O(n√k logm) time

Case 2: P has more than 2√k frequent symbols

Page 130: Pattern Matching Part Two: k-mismatches

Pattern matching with k-mismatches: putting it all together

Algorithm summary

Preprocess P, T for LCP queries -O(n) time

Count the number of frequent symbols in P -O(m logm) time

Case 1: P has at most 2√k frequent symbols

Count matches with frequent symbols using cross-correlations -O(n√k logm) time

Count matches with infrequent symbols directly -O(n√k) time

Case 2: P has more than 2√k frequent symbols

Page 131: Pattern Matching Part Two: k-mismatches

Pattern matching with k-mismatches: putting it all together

Algorithm summary

Preprocess P, T for LCP queries -O(n) time

Count the number of frequent symbols in P -O(m logm) time

Case 1: P has at most 2√k frequent symbols

Count matches with frequent symbols using cross-correlations -O(n√k logm) time

Count matches with infrequent symbols directly -O(n√k) time

Case 2: P has more than 2√k frequent symbols

Filter the text, leaving n/√k alignments -O(n

√k) time

Page 132: Pattern Matching Part Two: k-mismatches

Pattern matching with k-mismatches: putting it all together

Algorithm summary

Preprocess P, T for LCP queries -O(n) time

Count the number of frequent symbols in P -O(m logm) time

Case 1: P has at most 2√k frequent symbols

Count matches with frequent symbols using cross-correlations -O(n√k logm) time

Count matches with infrequent symbols directly -O(n√k) time

Case 2: P has more than 2√k frequent symbols

Filter the text, leaving n/√k alignments -O(n

√k) time

Count mismatches at these alignments using LCP queries -O(n√k) time

Page 133: Pattern Matching Part Two: k-mismatches

Pattern matching with k-mismatches: putting it all together

Algorithm summary

Preprocess P, T for LCP queries -O(n) time

Overall, we obtain a time complexity of O(n√k logm).

Count the number of frequent symbols in P -O(m logm) time

Case 1: P has at most 2√k frequent symbols

Count matches with frequent symbols using cross-correlations -O(n√k logm) time

Count matches with infrequent symbols directly -O(n√k) time

Case 2: P has more than 2√k frequent symbols

Filter the text, leaving n/√k alignments -O(n

√k) time

Count mismatches at these alignments using LCP queries -O(n√k) time

Page 134: Pattern Matching Part Two: k-mismatches

Pattern matching with k-mismatches: putting it all together

Algorithm summary

Preprocess P, T for LCP queries -O(n) time

Overall, we obtain a time complexity of O(n√k logm).

Count the number of frequent symbols in P -O(m logm) time

Case 1: P has at most 2√k frequent symbols

Count matches with frequent symbols using cross-correlations -O(n√k logm) time

Count matches with infrequent symbols directly -O(n√k) time

Case 2: P has more than 2√k frequent symbols

Filter the text, leaving n/√k alignments -O(n

√k) time

Count mismatches at these alignments using LCP queries -O(n√k) time

- this can be improved toO(n√k log k)

Page 135: Pattern Matching Part Two: k-mismatches

Conclusion

T

Input A text string T (length n), a pattern string P (lengthm) and a positive integer k

P

ba b c a a d ad a

Goal: For all i, output,

a a a

0 1 2 3 4 5 6 7 8 9 10 11 12

n

a b d

m

a

Hamk(6) = 1

Output the number of mismatches. . . unless it’s more than k(we interpret the outputX to mean “too many mismatches”)

Hamk(i) =

Ham(i) if Ham(i) 6 k

X if Ham(i) > k

k = 2

One algorithm takesO(nk) time

The other algorithm takesO(n√k logm) time (improvable toO(n

√k log k) time)

We saw two algorithms for this problem: