Top Banner
1 A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore Michal Ziv Ukelson Gad M. Landau
70

Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

Apr 13, 2018

Download

Documents

dangkhanh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

1

A Sub-quadratic

Sequence Alignment Algorithm for Unrestricted Scoring Matrices

Maxime Crochemore

Michal Ziv Ukelson

Gad M. Landau

Page 2: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

2

A = c t a c g a g a c

B = a a c g a c g a t

- a c g t

- -1 -1 -1 -1

a -1 1 -1 -1 -1

c -1 -1 1 -1 -1

g -1 -1 -1 1 -1

t -1 -1 -1 -1 1

The Sequence Alignment Problem

Compare two strings A and B and measure their similarity

by finding the optimal alignment between them.

The alignment is classically based on the transformation

of one sequence into the other, via operations of substitutions,

insertions, and deletions (indels).

The Scoring Matrix

Page 3: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

3

A = c t a c g a g a c

B = a a c g a c g a t

A = c t a c g a g a c

B = a a c g a c g a t

Global Alignment.

Local Alignment.

Two Sequence Alignment Problems

Page 4: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

4

A = c t a c g a g a c

B = a a c g a c g a t

A = c t a c g a g a c

B = a a c g a c g a t

Global Alignment.

Local Alignment.

- a c g t

- -1 -1 -1 -1

a -1 1 -1 -1 -1

c -1 -1 1 -1 -1

g -1 -1 -1 1 -1

t -1 -1 -1 -1 1

The Scoring Matrix

Two Sequence Alignment Problems

Page 5: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

5

A = c t a c g a g a c

B = a a c g a c g a t

A = c t a c g a g a c

B = a a c g a c g a t

Global Alignment.

Local Alignment.

- a c g t

- -1 -1 -1 -1

a -1 1 -1 -1 -1

c -1 -1 1 -1 -1

g -1 -1 -1 1 -1

t -1 -1 -1 -1 1

The Scoring Matrix

Two Sequence Alignment Problems

Value: 2

Page 6: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

6

A = c t a c g a g a c

B = a a c g a c g a t

A = c t a c g a g a c

B = a a c g a c g a t

Global Alignment.

Local Alignment.

- a c g t

- -1 -1 -1 -1

a -1 1 -1 -1 -1

c -1 -1 1 -1 -1

g -1 -1 -1 1 -1

t --1 -1 -1 -1 1

The Scoring Matrix

Two Sequence Alignment Problems

Value:

Value: 2

5

Page 7: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

8

- a c g t

- -1 -1 -1 -1

a -1 1 -1 -1 -1

c -1 -1 1 -1 -1

g -1 -1 -1 1 -1

t -1 -1 -1 -1 1

The Scoring Matrix

The O(n ) time, Classical Dynamic Programming Algorithm2

c

a

c

t

a a c g a c g a0

1

1 2 3 4 5 6 7 8

2

3

4

a

g

a

g5

6

7

c

|B|= n

|A|= n

t9

9

8

0 1 2 3 4 5 6 8 7 9

The Alignment Graph

Page 8: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

9

Computing the Optimal Global Alignment Value

c

a

c

t

a a c g a c g a0

1

1 2 3 4 5 6 7 8

2

3

4

a

g

a

g5

6

7

c

|B|= n

|A|= n

t9

9

8

0 1 2 3 4 5 6 8 7 9

Classical Dynamic Programming: O(n )

Score of = 1

Score of = -1

2

Page 9: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

10

Computing an Optimal Local Alignment Value

c

a

c

t

a a c g a c g a0

1

1 2 3 4 5 6 7 8

2

3

4

a

g

a

g5

6

7

c

|B|= n

|A|= n

t9

9

8

0 1 2 3 4 5 6 8 7 9

Classical Dynamic Programming: O(n )

Score of = 1

Score of = -1

2

Page 10: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

11

The O(n ) time, Classical Dynamic Programming Algorithm2

c

a

c

t

a a c g a c g a0

1

1 2 3 4 5 6 7 8

2

3

4

a

g

a

g5

6

7

c

|B|= n

|A|

= n

t9

9

8

0 1 2 3 4 5 6 8 7 9

The Alignment Graph

I1

I2 I3

O

O = max(I + edge[I ,O])x

x = 1

3

x

Can the quadratic complexity of the optimal alignment value

computation be reduced without relaxing the problem?

Page 11: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

13

Previous Results [Masek and Paterson 1980]

- An O(n ) / log n time global alignment algorithm.

- Constant size alphabet.

- Scoring Matrix values restricted to rational numbers.

Can the quadratic complexity of the optimal alignment value

computation be reduced without relaxing the problem?

2

Open Problem [Masek and Paterson 1980]

Can a better algorithm be found for the constant alphabet case,

which does not restrict the scoring matrix values?

Page 12: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

14

LZ78 Parsing: Each phrase is the longest matching phrase

seen previously, plus one character.

B = a a c g a c g a t

Acceleration by Text Compression:

Compress the sequences in order to speed up the alignment process.

Trie for B

0

Page 13: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

15

LZ78 Parsing: Each phrase is the longest matching phrase

seen previously, plus one character.

B = a a c g a c g a t

Acceleration by Text Compression:

Compress the sequences in order to speed up the alignment process.

(0, a)

1

Trie for B

0a

1

Page 14: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

16

LZ78 Parsing: Each phrase is the longest matching phrase

seen previously, plus one character.

B = a a c g a c g a t

Acceleration by Text Compression:

Compress the sequences in order to speed up the alignment process.

(0, a) (1, c)

1 2

Trie for B

0a

1c

2

Page 15: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

17

LZ78 Parsing: Each phrase is the longest matching phrase

seen previously, plus one character.

B = a a c g a c g a t

Acceleration by Text Compression:

Compress the sequences in order to speed up the alignment process.

(0, a) (1, c) (0, g)

1 2 3

Trie for B

0a

1c

2

g

3

Page 16: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

18

LZ78 Parsing: Each phrase is the longest matching phrase

seen previously, plus one character.

B = a a c g a c g a t

Acceleration by Text Compression:

Compress the sequences in order to speed up the alignment process.

(0, a) (1, c) (0, g) (2, g)

1 2 3 4

Trie for B

0a

1c

2

g

3

g

4

Page 17: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

19

LZ78 Parsing: Each phrase is the longest matching phrase

seen previously, plus one character.

B = a a c g a c g a t

Acceleration by Text Compression:

Compress the sequences in order to speed up the alignment process.

(0, a) (1, c) (0, g) (2, g) (1, t)

1 2 3 4 5

Trie for B

0a

1c

2

g

3

g

4

5

t

Page 18: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

Ziv and Lempel

Page 19: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

21

LZ78 Parsing: Each phrase is the longest matching phrase

seen previously, plus one character.

B = a a c g a c g a t

Theorem 1 .[Lempel and Ziv 1976]

Given a sequence S of size n over a constant alphabet.

The maximal number of phrases obtained by any scheme which parses

S into distinct phrases is O(n / log n).

Acceleration by Text Compression:

Compress the sequences in order to speed up the alignment process.

(0, a) (1, c) (0, g) (2, g) (1, t)

1 2 3 4 5

Page 20: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

22

LZ78 Parsing: Each phrase is the longest matching phrase

seen previously, plus one character.

B = a a c g a c g a t

Theorem 1 .[Lempel and Ziv 1976]

Given a sequence S of size n over a constant alphabet.

The maximal number of phrases obtained by any scheme which parses

S into distinct phrases is O(n / log n).

Acceleration by Text Compression:

Compress the sequences in order to speed up the alignment process.

(0, a) (1, c) (0, g) (2, g) (1, t)

1 2 3 4 5

Theorem 2.[Ziv and Lempel 1978]

Given a sequence S of size n over a constant alphabet.

The number of phrases obtained by LZ78 parsing of S

is O(h n / log n), where h <=1.

For most texts, h is the entropy of the text, which is a measure of how "compressible" the text is.

Page 21: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

24

ctacg

ag

a

a a c g a c g

c

O(n ) vertices2

a t

Page 22: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

25

ctacg

ag

a

a a c g a c g

c

O(n ) vertices2

a t

Page 23: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

26

ctacg

ag

a

a a c g a c g

c

O(n ) vertices2

a t

Page 24: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

27

ctacg

ag

a

a a c g a c g

c

O(n ) vertices2 O(n ) vertices2

a t

Page 25: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

28

ctacg

ag

a

a a c g a c g

c

O(n ) vertices2

a t

Page 26: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

29

ctacg

ag

a

a a c g a c g

c

O(n ) vertices2

a t

Page 27: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

30

ctacg

ag

a

a ta a c g a c g

c

O(h n / log n) rows of n vertices +

O(h n / log n) columns of n vertices

ctacg

ag

a

a a c g a c g

c

O(n ) vertices2

a t

Page 28: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

31

ctacg

ag

a

a ta a c g a c g

c

O(h n / log n) rows of n vertices +

O(h n / log n) columns of n vertices

Our Results:

O(hn / log n) algorithm for Computing the

Optimal Global Alignment Value

and

Optimal Local Alignment Value.

2

Reminder: h <=1,

scoring matrix entries

may be arbitrary real

numbers.

Page 29: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

32

The work for each block.

a a c g a c gctacg

ag

a

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

g

a

ga c

I4 I5 I6I3

I2

I1

O2 O3 O4O1

O6

O5

The work for each block is done in O(t) time.

c

t = |I| = |O| = 6G

Page 30: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

33

g

a

ga c

I4 I5 I6I3

I2

I1

O4

a a c g a c gctacg

ag

a

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

c

O4 = the weight of an optimal path from vertex (0,0) to vertex 4 of O.

O4

Computing the score for Output Border Vertex O4

x

Page 31: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

34

g

a

ga c

I4 I5 I6I3

I2

I1

O4

a a c g a c gctacg

ag

a

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

c

Ix = the weight of an optimal path from vertex (0,0) to vertex x of I.

x

O4 = the weight of an optimal path from vertex (0,0) to vertex 4 of O.

DIST[x,y] = the weight of an optimal path from vertex Ix to vertex O4 .

Computing the score for Output Border Vertex O 4

O4

Oy

Page 32: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

35

g

a

ga c

I4 I5 I6I3

I2

I1

O4

a a c g a c gctacg

ag

a

a t1

1 2 3 4 5

2

3

4

5

6

c

I1 DIST[1,4] I1+DIST[1,4]

I2 DIST[2,4] I2+DIST[2,4]

I3 DIST[3,4] I3+DIST[3,4]

I4 DIST[4,4] I4+DIST[4,4]

I5 DIST[5,4] I5+DIST[5,4]

I6 DIST[6,4] I6+DIST[6,4]

Computing the score for Output Border Vertex O4

O4

Page 33: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

36

g

a

ga c

I4 I5 I6I3

I2

I1

O4

a a c g a c gctacg

ag

a

a t1

1 2 3 4 5

2

3

4

5

6

c

I1 DIST[1,4] I1+DIST[1,4]

I2 DIST[2,4] I2+DIST[2,4]

I3 DIST[3,4] I3+DIST[3,4]

I4 DIST[4,4] I4+DIST[4,4]

I5 DIST[5,4] I5+DIST[5,4]

I6 DIST[6,4] I6+DIST[6,4]

Computing the score for Output Border Vertex O4

O = max(I + DIST[x,3])x 4 x = 0

6

O4

Page 34: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

37

g

a

ga c

I4 I5 I6I3

I2

I1

O4

Standard, single-cell DP

I1

I2 I3

O

O = max(I + edge[I ,O])x

x = 1

3

x

O = max(I + DIST[x,3])x 4 x = 0

6

New, extended-cell DP

Page 35: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

38

g

a

ga c

3

I3

2

I4

1

I5

3

I6

2

I2

1

I1

O4

?

O = max(I + DIST[x,3])x 4 x = 0

6

Score of = 1

Computing the score for Output Border Vertex O 4

Score of = -1

I[*] + DIST[*,4] = OUT[*,4]

1 -3 -2

2 -1 1

3 1 4

2 0 2

1 0 1

3 -2 1

Page 36: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

39

g

a

ga c

3

I3

2

I4

1

I5

3

I6

2

I2

1

I1

O4

?

Score of = 1

Computing the score for Output Border Vertex O4

Score of = -1

I[*] + DIST[*,4] = OUT[*,4]

1 -3 -2

2 -1 1

3 1 4

2 0 2

1 0 1

3 -2 1

O = max(I + DIST[x,3])x 4 x = 0

6

Page 37: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

40

g

a

ga c

3

I3

2

I4

1

I5

3

I6

2

I2

1

I1

O4

4

Score of = 1

Computing the score for Output Border Vertex O4

Score of = -1

I[*] + DIST[*,4] = OUT[*,4]

1 -3 -2

2 -1 1

3 1 4

2 0 2

1 0 1

3 -2 1

O = max(I + DIST[x,3])x 4 x = 0

6

Page 38: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

41

I1 = 1

I2 = 2

I3 = 3

I4 =

I5 = 1

I6 =

OUT[x,j] = Ix + DIST[x,j]

Input I\ DIST Matrix

Output vector O

g

a

ga c

3

I3

2

I4

1

I5

3

I6

2

I2

1

I1

O2 O4O3

O5

O6

O1

41 3 3

2

3

Page 39: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

42

I1 = 1

I2 = 2

I3 = 3

I4 =

I5 = 1

I6 =

OUT[x,j] = Ix + DIST[x,j]

Input I\ DIST Matrix

Output vector O

g

a

ga c

3

I3

2

I4

1

I5

3

I6

2

I2

1

I1

O2 O4O3

O5

O6

O1

41 3 3

2

3

Output Vector O values

are set to

OUT Matrix Column Maxima

Page 40: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

43

I1 = 1

I2 = 2

I3 = 3

I4 =

I5 = 1

I6 =

OUT[x,j] = Ix + DIST[x,j]

Input I\ DIST Matrix

Output vector O

The Main Challenges

Page 41: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

44

I1 = 1

I2 = 2

I3 = 3

I4 =

I5 = 1

I6 =

OUT[x,j] = Ix + DIST[x,j]

Input I\ DIST Matrix

Output vector O

How to compute the

column maxima

of OUT in O(t) time ?

(Utilize the

Total Monotonicity

Property of OUT).

The Main Challenges

Page 42: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

45

I1 = 1

I2 = 2

I3 = 3

I4 =

I5 = 1

I6 =

OUT[x,j] = Ix + DIST[x,j]

Input I\ DIST Matrix

Output vector O

How to compute the

column maxima

of OUT in O(t) time ?

(Utilize the

Total Monotonicity

Property of OUT).

How to obtain the DIST

for G in O(t) time ?

(Take advantage of the

incremental nature of

LZ78 parsing).

The Main Challenges

Page 43: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

46

a

(0,0)

I

O

b

c d

The Total Monotonicity Property [Aggarwal et al

1987].

For any a < b and c < d

OUT[b,c] >= OUT[a,c] ===> OUT[b,d] >= OUT[a,d]

Page 44: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

47

a

(0,0)

I

O

b

c d

The Total Monotonicity Property [Aggarwal et al 1987].

For any a < b and c < d

OUT[b,c] >= OUT[a,c] ===> OUT[b,d] >= OUT[a,d]

Page 45: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

48

a

(0,0)

I

O

b

c d

The Total Monotonicity Property [Aggarwal et al 1987].

For any a < b and c < d

OUT[b,c] >= OUT[a,c] ===> OUT[b,d] >= OUT[a,d]

Page 46: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

49

a

(0,0)

I

O

b

c d

The Total Monotonicity Property [Aggarwal et al 1987].

For any a < b and c < d

OUT[b,c] >= OUT[a,c] ===> OUT[b,d] >= OUT[a,d]

Page 47: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

50

a

(0,0)

I

O

b

c d

The Total Monotonicity Property [Aggarwal et al 1987].

For any a < b and c < d

OUT[b,c] >= OUT[a,c] ===> OUT[b,d] >= OUT[a,d]

Page 48: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

51

a

Y

Z

X

W

(0,0)

I

O

b

c dProof:

OUT[a,c] is optimal, and therefore OUT[a,c] >= X + Z

OUT[b,d] is optimal, and therefore OUT[b,d] >= Y + W

OUT[b,c] = Y + Z >= OUT[a,c] >= X+Z ===> Y+Z >= X+ Z==> Y >= X

Therefore, OUT[b,d] >= Y + W >= X+W = OUT[a,d]

The Total Monotonicity property of OUTFor any a < b and c < d

OUT[b,c] >= OUT[a,c] ===> OUT[b,d] >= OUT[a,d]

Page 49: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

52

a

b

d

OUT Matrix

How does Total Monotonicity affect Column Maxima behavior?

For all a <b and c < d, OUT[a,c] <= OUT[b,c] OUT[a,d] <= OUT[b,d]

Column maxima row indices are monotonically non-decreasing.

SMAWK Matrix Searching[Aggarwal et-al 87] .

The t column maxima of a Totally Monotone array

can be computed in O(t) time, by querying only O(t) elements.

Page 50: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

53

OUT Matrix

The Rectangle Problem

g

a

ga c

Page 51: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

54

Complementing the undefined OUT entries

(without introducing new column maxima)

1. Upper Right Triangle. All values are set to .

2. Lower Left Triangle.

Let k denote th maximal absolute value of a score in

the scoring matrix .

OUT[i,j] in the lower left triangle will be set to -(n+i+1)*k.

For all a <b and c < d, OUT[a,c] <= OUT[b,c] OUT[a,d] <= OUT[b,d]

Page 52: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

55

Complementing the undefined OUT entries,

without changing its Total Monotonicity property,

and without introducing new column maxima.

OUT Matrix

Page 53: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

56

I1 = 1

I2 = 2

I3 = 3

I4 =

I5 = 1

I6 =

OUT[x,j] = Ix + DIST[x,j]

Input I\ DIST Matrix

Output vector O

How to compute the

column maxima

of OUT in O(t) time ?

(Utilize the

Total Monotonicity

Property of OUT).

The Main Challenges

Page 54: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

57

I1 = 1

I2 = 2

I3 = 3

I4 =

I5 = 1

I6 =

OUT[x,j] = Ix + DIST[x,j]

Input I\ DIST Matrix

Output vector O

How to compute the

column maxima

of OUT in O(t) time ?

(Utilize the

Total Monotonicity

Property of OUT).

How to obtain the DIST

for G in O(t) time ?

(Take advantage of the

incremental nature of

LZ78 parsing).

The Main Challenges

Page 55: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

58 G

g

a

ga c

ctacg

ag

a

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

a a c g a c g

c

Utilizing the incremental nature of LZ78 parsing

for efficient DIST construction.

Page 56: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

59 left prefix block of G G

g

a

ga c

g

a

a c

ctacg

ag

a

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

a a c g a c g

c

5/2

4 = (2, g)

5/4

55

2

Page 57: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

60 G

g

a

ga c

a

3/4

ctacg

ag

a

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

a a c g a c g

c

top prefix block of G

ga c

Page 58: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

61

3/4

5/2

3/2

left

prefix

(5/2)

diagonal

prefix

(3,2)

top

prefix

(3,4)

G (5,4)

g

a

ga cga c

g

a

a c

a

a c

a

ctacg

ag

a

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

a a c g a c g

c

Page 59: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

62

3/43/2

g

a

ga c ga c

g

a

a c

a

ctacg

ag

a

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

a a c g a c g

c

-3

-1

1

0

0

-2

Only one new DIST column needs to be

computed for each

block, and this DIST

column is computed in

O(t) time.diagonal

prefix

a

a c

5/2

left prefix

left prefix

top prefix

I3

O4

Page 60: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

63

3/4

5/2

3/2

left prefix (5,2)

diagonal

prefix (3,2)

top

prefix(3,4)

block G (5,4)

g

a

ga cga c

g

a

a c

a

a c

a

ctacg

ag

a

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

a a c g a c g

c

Accessing a Prefix Block in Constant time.

a c

Trie for A

0

13

5

2

t

4

gg

a

c

g

g

Trie for B

0

31

2

46

5c

t

Page 61: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

Presentation

• R89922024 蘇展弘

• B86202049 葉恆青

• R90725054 呂育恩

• R90922001 張文亮

• R90922091 游騰楷

Page 62: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

Trie for A

0

31

2

54

g

c

ta

g

Trie for B

0

1 3

2

4

g

ga

c

DIST(5,4)

-3

-1

1

0

0

-2

-2

-2

0

-2

-2

-1

-1

0

-2

0

-1

-2

-2

-1

-2

-1

-1

-3

-2

-1

0

0 1 2 3 4

a a c g a c g a

1 c

2 t

3 a

4c

g

5a

g

a

Data Strucure

Page 63: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

Trie for A

0

31

2

54

g

c

ta

g

Trie for B

0

1 3

2

4

g

ga

c

DIST(5,4)

-3

-1

1

0

0

-2

-2

-2

0

-2

-2

-1

-1

0

-2

0

-1

-2

-2

-1

-2

-1

-1

-3

-2

-1

0

0 1 2 3 4

a a c g a c g a

1 c

2 t

3 a

4c

g

5a

g

a

Construction

Page 64: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

Time and Space Complexity

• 作new column

Trie for A

0

31

2

54

g

c

ta

g

Trie for B

0

1 3

2

4

g

ga

c

DIST(5,4)DIST(5,4)

-2

0

0

1

-1

-3

-2

0

0

1

-1

-3

-2

-2

0

-2

-2

-2

-2

0

-2

-2

-2

0

-1

-1

-2

0

-1

-1

-2

-1

0

-2

-1

0

-1

-1

-2

-1

-2

-1

-1

-2

-1

-2

0

-1

-2

-3

0

-1

-2

-3

a

a

g5

c

g4

a3

t2

c1

aa c gga ca

43210

a

a

g5

c

g4

a3

t2

c1

aa c gga ca

43210

Data Strucure

• 作DIST vector ( 即找出該DIST matrix所有的

column)

• 用SMAWK從這個DIST(加上input)算出

output maxima。

O ( t )

Page 65: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

68

ctacg

ag

a

a ta a c g a c g

c

O(h n / log n) rows of n vertices +

O(h n / log n) columns of n vertices

ctacg

ag

a

a a c g a c g

c

O(n ) vertices2

a t

Page 66: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

69

ctacg

ag

a

a ta a c g a c g

c

O(h n / log n) rows of n vertices +

O(h n / log n) columns of n vertices

Our Results:

O(hn / log n) algorithm for Computing the

Optimal Global Alignment Value

and

Optimal Local Alignment Value.

2

Reminder: h <=1,

scoring matrix entries

may be arbitrary real

numbers.

Page 67: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

70

I1 = 1

I2 = 2

I3 = 3

I4 =

I5 = 1

I6 =

OUT[x,j] = Ix + DIST[x,j]

Input I\ DIST Matrix

Output vector O

How to compute the

column maxima

of OUT in O(t) time ?

(Utilize the

Total Monotonicity

Property of OUT).

How to obtain the DIST

for G in O(t) time ?

(Take advantage of the

incremental nature of

LZ78 parsing).

The Main Challenges

Page 68: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

71

Summary of Results:

Global Alignment Problem.

-An O(hn / log n) time and space complexity algorithm for computing

the optimal global alignment value.

-After the optimal value has been computed, an optimal alignment trace

can be recovered in time linear with its size.

Local Alignment Problem.

-An O(hn / log n) time and space complexity algorithm for computing

the optimal local alignment value.

-After the optimal value has been computed, given a vertex whose

score is maximal, an optimal alignment trace ending in the vertex

can be recovered in time linear with its size.

2

2

Page 69: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

72

Open Problems:

We showed an O(hn / log n) time and space complexity

algorithm for computing the optimal global and local

alignment values of two strings.

In the paper we show how to reduce the space complexity

to O(h n / log n) .

Can the space requirement of the algorithm be further reduced,

without impairing its sub-quadratic time complexity?

2

2 2 2

Page 70: Sequence Alignment Algorithm A Sub-quadraticmichaluz/seminar/lecture3.pdf · Sequence Alignment Algorithm for Unrestricted Scoring Matrices Maxime Crochemore ... Computing the Optimal

73

Thank You !