Top Banner
Pairwise Sequence Comparison Stat 246, Spring 2002, Week 5,
82

Pairwise Sequence Comparison

Jan 25, 2016

Download

Documents

Sumya Sumya

Pairwise Sequence Comparison. Stat 246, Spring 2002, Week 5,. Sequence comparison: topics. General concepts Dot plots Global alignments Scoring matrices Gap penalties Dynamic programming Chance or common ancestry?. Dot Plot. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Pairwise Sequence Comparison

Pairwise Sequence Comparison

Stat 246, Spring 2002, Week 5,

Page 2: Pairwise Sequence Comparison

Sequence comparison: topics

General concepts

Dot plots

Global alignments

Scoring matrices

Gap penalties

Dynamic programming

Chance or common ancestry?

Page 3: Pairwise Sequence Comparison

Dot Plot

This is the earliest, simplest and most complete method for comparing two sequences

It is possible to filter the plot to minimise noise whilst preserving the obvious relationship

This plot can identify

• regions of similarity

• internal repeats

• rearrangement events

Page 4: Pairwise Sequence Comparison

A C A C A C T A

A

G

C

A

C

A

C

A

b

a .A dot goes where the two sequences match

Sequence1 down:

Sequence 2along:

(Add a “guard” row and colum.)

Connect the dotsalong diagonals.

Page 5: Pairwise Sequence Comparison

Extensions to dot plots

Modern dot plots are more sophisticated, using the notions of

window : size of diagonal strip centered on an entry, over which matching is accumulated, and

stringency: the extent of agreement required over the window, before a dot is placed at the central entry.

e.g. for a window of size 5, we might require at least 3 matches, and then we put a dot in the central spot. More complex scoring rules can be used.

Page 6: Pairwise Sequence Comparison

Human globin vs. human myoglobin

a

beta-human.pep ck: 1,242, 1 to 146050100150100500

Page 7: Pairwise Sequence Comparison

Human LDL receptor vs. itself (w=30, s=9)

a

ldlrecep.pep ck: 3,641, 1 to 860 02004006008008006004002000

Page 8: Pairwise Sequence Comparison

Human LDL receptor vs. itself (40, 15)

COMPARE Window: 40 Stringency: 15.0 Points: 5,287

ldlrecep.pep ck: 3,641, 1 to 860

ldlrecep.pep ck: 3,641, 1 to 860

0

200

400

600

800

8006004002000

Page 9: Pairwise Sequence Comparison

Human LDL receptor vs. itself (40, 17.5)

ldlrecep.pep ck: 3,641, 1 to 860

0

200

400

600

800

8006004002000

COMPARE Window: 40 Stringency: 17.5 Points: 3,079

ldlrecep.pep ck: 3,641, 1 to 860

Page 10: Pairwise Sequence Comparison

Human LDL receptor vs. itself (40, 20)

ldlrecep.pep ck: 3,641, 1 to 860

0

200

400

600

800

8006004002000

COMPARE Window: 40 Stringency: 20.0 Points: 2,295

ldlrecep.pep ck: 3,641, 1 to 860

Page 11: Pairwise Sequence Comparison

Plasmodium falciparum MSP3 vs. itself (30,9)

a

msp3.pep ck: 4,247, 1 to 3800100200300

3002001000

Page 12: Pairwise Sequence Comparison

Plasmodium falciparum MSP3 vs. itself (20,9)

COMPARE Window: 20 Stringency: 9.0 Points: 15,619

msp3.pep ck: 4,247, 1 to 380

msp3.pep ck: 4,247, 1 to 380

0

100

200

300

3002001000

Page 13: Pairwise Sequence Comparison

Plasmodium falciparum MSP3 vs. itself (10,9)

COMPARE Window: 10 Stringency: 9.0 Points: 1,263

msp3.pep ck: 4,247, 1 to 380

msp3.pep ck: 4,247, 1 to 380

0

100

200

300

3002001000

Page 14: Pairwise Sequence Comparison

Global alignment

An alignment of two sequences a and b is an arrangement of a and b by position, where a and b can be padded with gap symbols to achieve the same length:

a: AGCACAC-A or AG-CACACA

b: A-CACACTA ACACACT-A

If we read the alignment column-wise, we have a protocol of edit operations that lead from a to b.

Left: Match (A,A) Right: Match (A,A)

Delete (G,-) Replace (G,C)

Match (C,C) Insert (-,A)

Match (A,A) Match (C,C)

Match (C,C) Match (A,A)

Match (A,A) Match (C,C)

Match (C,C) Replace (A,T)

Insert (-,T) Delete (C,-)

Match (A,A) Match (A,A)

The left-hand alignment shows one Delete, one Insert, and the other edit operations are Matches.

The right-hand alignment shows one Insert, one Delete, two Replaces, and some trivial ones.

Page 15: Pairwise Sequence Comparison

Cost (scoring) of global alignments; optimal global alignments

Next we turn the edit protocol into a measure of distance by assigning a “cost” or “weight” S to each operation. For example, for arbitrary characters u,v from A we may define

S(u,u) = 0; S(u,v) = 1 for u ≠ v; S(u,-) = S(-,v) = 1. (Unit Cost)

This scheme is known as the Levenshtein distance, also called unit cost model. Its predominant virtue is its simplicity. In general, more sophisticated cost models must be used. For example, replacing an amino acid by a biochemically similar one should weight less than a replacement by an amino acid with totally different properties. Details shortly. Now we are ready to define the most important notion for sequence analysis:

The cost of an alignment of two sequences a and b is the sum of the costs of all the edit operations that lead from a to b.

An optimal alignment of a and b is an alignment which has minimal cost among all possible alignments.

The edit distance of a and b is the cost of an optimal alignment of a and b under a cost function S. We denote it by d(a,b).

Using the unit cost model for S in our previous example, we obtain the following cost:

a: AGCACAC-A or AG-CACACA

b: A-CACACTA ACACACT-A

cost: 2 cost: 4

Here it is easily seen that the left-hand assignment is optimal under the unit cost model, and hence the edit distance d(a,b) = 2.

Page 16: Pairwise Sequence Comparison

More general scores = - costs: see later.

C 9

S -1 4

T -1 1 5

P -3 -1 -1 7

A 0 1 0 -1 4

G -3 0 -2 -2 0 6

N -3 1 0 -2 -2 0 6

D -3 0 -1 -1 -2 -1 1 6

E -4 0 -1 -1 -1 -2 0 2 5

Q -3 0 -1 -1 -1 -2 0 0 2 5

H -3 -1 -2 -2 -2 -2 1 -1 0 0 8

R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5

K -3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5

M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5

I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4

L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4

V -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4

F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6

Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7

W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

C S T P A G N D E Q H R K M I L V F Y W

134 LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETLLI | ||| | | |||||| | || || 137 LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETFVM

D:D = +6

D:R = -2

From Henikoff 1996

Page 17: Pairwise Sequence Comparison

Scoring Matrices

Physical/Chemical similarities

comparing two sequences according to the properties of their residues may highlight regions of structural similarity

Identity matrices

by stressing only identities in the alignment, stretches of sequence that may have diverged will not penalise any remaining common features

Page 18: Pairwise Sequence Comparison

Scoring Matrices (ctd)

As the direct source of residue by residue comparison scores the scoring matrix you choose will have a major impact on the alignment calculated

The most commonly used will be one of the mutation matrices

PAM or BLOSUM

Von Bing will explain the derivation of these and other mutation matrices next Tuesday.

The matrix that performs best will be the matrix that best reflects the evolutionary separation of the sequences being aligned.

Page 19: Pairwise Sequence Comparison

Statistical motivation for alignment scores

pr(data|H) = pr( |H) = pr( |H) x ...

= (1-p)apd d = # disagreements, a = # agreements, p = (1-e-8t)

pr(data|R) = pr( |R) = pr( |R) x ...

= ( )a( )d

= a log + d log . Since p < , log <0, log >0

score = a + d (-) >0 match score, -<0 mismatch penalty

Note that if t 0, p 6t, 1-p 1 and so log4, while - log8t is large and negative: a big difference in the two scores.

Conversely, if t is large, p = (1-), = 1-, and log(1-) -, while 1-p = (1+3), = 1+3, and so log(1+3) 3. Thus the scores are about 3:1.

AGCTGATCA...AACCGGTTA...Alignment: H = homologous (indep. sites, Jukes-

Cantor)R = random (indep. sites, equal freq.)

Hypotheses:

34

34

14

log {pr(data|H)pr(data|R) } 1-p

1/4 p3/4

34

p3/4

1-p1/4

≈ ≈ ≈ ≈ ≈

34

p3/4 ≈

14

1-p1/4

Page 20: Pairwise Sequence Comparison

We can do the same with any other Markov substitution matrix for molecular evolution. E.g. with a PAM or BLOSUM matrix of probabilities,

a1 ..... am

b1 ..... bmdata = a gap free alignment of two a.a. sequence fragments

pr(data|H) = aipaibi(2t) pr(data|R) = aibi

log{ } = log{ }

The elements of a log-odds score matrix are typically > 0 on the diagonal and < 0 off the diagonal, but not always.

Also the relative sizes of match and mismatch penalties increase as #PAMs (t) decreases. Thus PAM(120) is more stringent than PAM(250), while PAM(360) is less stringent than it.

PAM(0) = the identity matrix is the toughest.

There are plenty of score matrices based on other principles.

m

1

i

pr(data|H)pr(data|R)

ipaibi(2t)/ bi

Page 21: Pairwise Sequence Comparison

Below diagonal: BLOSUM62 substitution matrixAbove diagonal: Difference matrix obtained by subracting the

PAM 160 matrix entrywise.

From Henikoff & Henikoff 1992

C S T P A G N D E Q H R K M I L V F Y W

0 -1 1 0 2 1 1 2 1 2 0 0 2 4 1 5 1 2 -2 5 C

2 0 -2 0 -1 0 0 0 1 0 0 0 1 0 1 -1 1 1 -1 S

C 9 2 -1 -1 -1 0 0 0 0 0 0 -1 0 -1 1 0 1 1 3 T

S -1 4 2 -2 -1 -1 0 0 -1 -1 -1 1 1 0 -1 0 0 2 1 P

T -1 1 5 2 -1 -2 -2 -1 0 0 1 1 0 0 1 0 1 1 2 A

P -3 -1 -1 7 2 0 -1 -2 0 1 1 0 0 -1 0 -1 1 2 4 G

A 0 1 0 -1 4 3 -1 -1 0 0 1 -1 0 -1 0 -1 0 0 0 N

G -3 0 -2 -2 0 6 2 -1 -1 -1 0 -1 0 0 0 0 2 1 3 D

N -3 1 0 -2 -2 0 6 1 0 0 2 2 1 -1 0 0 2 2 4 E

D -3 0 -1 -1 -2 -1 1 6 0 -2 0 1 1 -1 0 0 1 3 3 Q

E -4 0 -1 -1 -1 -2 0 2 5 2 -1 0 1 0 -1 0 1 2 2 H

Q -3 0 -1 -1 -1 -2 0 0 2 5 -1 -1 0 -1 1 0 1 3 -4 R

H -3 -1 -2 -2 -2 -2 1 -1 0 0 8 1 -2 -1 1 1 2 3 1 K

R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 -2 -1 -1 0 1 2 4 M

K -3 0 -1 -1 -1 -2 0 -1 1 1 -1 2 5 -1 1 0 0 1 3 I

M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 -1 0 -1 1 2 L

I -1 -2 -1 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 0 1 2 4 V

L -1 -2 -1 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 -1 -2 1 F

V -1 -2 0 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 2 Y

F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 -1 W

Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7

W -2 -3 -2 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

C S T P A G N D E Q H R K M I L V F Y W

Page 22: Pairwise Sequence Comparison

Above diagonal: SG scoring system (Feng et al., 1985)Below diagonal: Log-odds matrix for 250 PAMs (Dayhoff et al., 1978)

C S T P A G N D E Q H R K M I L V F Y W

6 4 2 2 2 3 2 1 0 1 2 2 0 2 2 2 2 3 3 3 C

6 5 4 5 5 5 3 3 3 3 3 3 1 2 2 2 3 3 2 S

C 12 6 4 5 2 4 2 3 3 2 3 4 3 3 2 3 1 2 1 T

S 0 2 6 5 3 2 2 3 3 3 3 2 2 2 3 3 2 2 2 P

T -2 1 3 6 5 3 4 4 3 2 2 3 2 2 2 5 2 2 2 A

P -3 1 0 6 6 3 4 4 2 1 3 2 1 2 2 4 1 2 3 G

A -2 1 1 1 2 6 5 3 3 4 2 4 1 2 1 2 1 3 0 N

G -3 1 0 -1 1 5 6 5 4 3 2 3 0 1 1 3 1 2 0 D

N -4 1 0 -1 0 0 2 6 4 2 2 4 1 1 1 4 0 1 1 E

D -5 0 0 -1 0 1 2 4 6 4 3 4 2 1 2 2 1 2 1 Q

E -5 0 0 -1 0 0 1 3 4 6 4 3 1 1 3 1 2 3 1 H

Q -5 -1 -1 0 0 -1 1 2 2 4 6 5 2 2 2 2 1 1 2 R

H -3 -1 0 0 -1 -2 2 1 1 3 6 6 2 2 2 3 0 1 1 K

R -4 0 0 0 -2 -3 0 -1 -1 1 2 6 6 4 5 4 2 2 3 M

K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5 6 5 5 4 3 2 I

M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6 6 5 4 3 4 L

I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 6 4 3 3 V

L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 6 5 3 F

V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 6 3 Y

F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 6 W

Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17

C S T P A G N D E Q H R K M I L V F Y W

Page 23: Pairwise Sequence Comparison

Gap penalties

Gap penalties are usually composed of two parts:

Gap opening penalty

This reduces the alignment score and therefore must create more significant alignment downstream than would be present if no gap were created

The size of the penalty is usually of the order of one to three times the size of values in the scoring matrix

Page 24: Pairwise Sequence Comparison

Gap penalties (ctd)

Gap extension penalty

If a gap has been created then extending it should not be as hard to do

On the other hand we want to limit the size of the gap to practical lengths

A smaller gap extension penalty may allow an alignment to resolve situations where complete loops may be missing between one structure and another

Page 25: Pairwise Sequence Comparison

Low gap penalty eclustalw May 24, 1999 18:44

lgb1_pea.pep ck: 2970 from: 1 to: 147 Length: 147 hbhu.pep ck: 3588 from: 1 to: 147 Length: 147

Pairwise similarity parameter: K-Tuple length: 1 Gap Penalty: 3 Number of diagonals: 5 Diagonal window size: 5 Scoring Method: Percentage

Multiple alignment parameter: Gap Penalty (fixed): 1.00 Gap Penalty (varying): 0.05 Gap separation penalty range: 8 Percent. identity for delay: 40% List of hydrophilic residue: GPSNDQEKR Protein Weight Matrix: blosum

10 20 30 40 50 60 . . . . . .LGB1_PEA.pep --GFTDKQE-ALVNSSSEFKQNLPGYSILFYTIVLEKAPAAKGLF-SF--LKDTAGVEDSHBHU.pep MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVY--PWTQRFFESFGDLSTPDAVMGN * . *. * * .*. * .. * ** * *

LGB1_PEA.pep PKLQAHAEQVFGLVRDSAAQLR-TKGEVVLGNATLGAIHVQKGVTNP-HFVVVKEALLQTHBHU.pep PKVKAHGKKVLGAFSDGLAHLDNLKGTF----ATLSELHCDKLHVDPENFRLLGNVLVCV **..** .* * * *.* ** *** .* * * .* .. *.

LGB1_PEA.pep IKKASGNNWSEELNTAWEVAYDGLATAIKKAMKTAHBHU.pep LAHHFGKEFTPPVQAAYQKVVAGVANAL--AHKYH . . * . ...* . *.*.*. * *

Page 26: Pairwise Sequence Comparison

Middling gap penalty eclustalw May 24, 1999 18:50

lgb1_pea.pep ck: 2970 from: 1 to: 147 Length: 147 hbhu.pep ck: 3588 from: 1 to: 147 Length: 147

Pairwise similarity parameter: K-Tuple length: 1 Gap Penalty: 3 Number of diagonals: 5 Diagonal window size: 5 Scoring Method: Percentage

Multiple alignment parameter: Gap Penalty (fixed): 25.00 Gap Penalty (varying): 0.05 Gap separation penalty range: 8 Percent. identity for delay: 40% List of hydrophilic residue: GPSNDQEKR Protein Weight Matrix: blosum

10 20 30 40 50 60 . . . . . .LGB1_PEA.pep ----GFTDKQEALVNSSSEFKQNLPGYSILFYTIVLEKAPAAKGLFSFLKDTAGVEDSPKHBHU.pep MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK .* . * .. .* . * * * **

LGB1_PEA.pep LQAHAEQVFGLVRDSAAQLRTKGEVVLGNATLGAIHVQKGVTNP-HFVVVKEALLQTIKKHBHU.pep VKAHGKKVLGAFSDGLAHLDN---LKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAH ..** .* * * *.* . . *** .* * * .* .. *. . .

LGB1_PEA.pep ASGNNWSEELNTAWEVAYDGLATAIKKAMKTAHBHU.pep HFGKEFTPPVQAAYQKVVAGVANALAHKYH-- * . ...* . *.*.*. . .

Page 27: Pairwise Sequence Comparison

Very high gap penalty eclustalw May 24, 1999 18:52

lgb1_pea.pep ck: 2970 from: 1 to: 147 Length: 147 hbhu.pep ck: 3588 from: 1 to: 147 Length: 147

Pairwise similarity parameter: K-Tuple length: 1 Gap Penalty: 3 Number of diagonals: 5 Diagonal window size: 5 Scoring Method: Percentage

Multiple alignment parameter: Gap Penalty (fixed): 50.00 Gap Penalty (varying): 0.05 Gap separation penalty range: 8 Percent. identity for delay: 40% List of hydrophilic residue: GPSNDQEKR Protein Weight Matrix: blosum

10 20 30 40 50 60 . . . . . .LGB1_PEA.pep ----GFTDKQEALVNSSSEFKQNLPGYSILFYTIVLEKAPAAKGLFSFLKDTAGVEDSPKHBHU.pep MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK .* . * .. .* . * * * **

LGB1_PEA.pep LQAHAEQVFGLVRDSAAQLRTKGEVVLGNATLGAIHVQKGVTNPHFVVVKEALLQTIKKAHBHU.pep VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPEN--FRLLGNVLVCVLAHH ..** .* * * *.* . . * . ... * * .. *. . .

LGB1_PEA.pep SGNNWSEELNTAWEVAYDGLATAIKKAMKTAHBHU.pep FGKEFTPPVQAAYQKVVAGVANALAHKYH-- * . ...* . *.*.*. . .

Page 28: Pairwise Sequence Comparison

Dynamic Programming

This is a mathematical implementation that can be seen as an extension of the dotplot method

Rather than dots, the comparison matrix positions are assigned values that reflect the scores in the scoring matrix

For obtaining optimal alignments

Page 29: Pairwise Sequence Comparison

Dynamic Programming

The optimum alignment is obtained by tracing the highest scoring path from the top left-hand corner to the bottom right-hand corner of the matrix

When the alignment steps away from the diagonal this implies an insertion or deletion event, the impact of which can be assessed by the application of a gap penalty

Page 30: Pairwise Sequence Comparison

A C A C A C T A

A

G

C

A

C

A

C

A

b

a 0 1 0 1 0 1 1 0

1 1 1 1 1 1 1 1

1 0 1 0 1 0 1 1

0 1 0 1 0 1 1 0

1 0 1 0 1 0 1 1

0 1 0 1 0 1 1 0

1 0 1 0 1 0 1 1

0 1 0 1 0 1 1 0

Page 31: Pairwise Sequence Comparison

Dynamic programming: the formula

Suppose that our two sequences are a=(a1,...,am) and b=(b1,...,bn),

and that we denote by dij the edit distance between the initial

segments ai=(a1,...,ai) and bj=(b1,...,bj) of a and b.

Extend this to i=j=0 by writing d00=0.

Supposing that a deletion or an insertion incurs a penalty of +1,

the following formula summarizes our verbal argument:

dij=min(di-1,j-1 + s(ai,bj), di,j-1 + 1, di-1,j + 1).

(More is needed to give a complete algorithm: what is it?)

Page 32: Pairwise Sequence Comparison

A C A C A C T A

0 1 2 3 4 5 6 7 8

A 1 0 1 2 3 4 5 6 7

G 2 1 1 2 3 4 5 6 7

C 3 2 1 2 2 3 4 5 6

A 4 3 2 1 2 2 3 4 5

C 5 4 3 2 1 2 2 3 4

A 6 5 4 3 2 1 2 3 3

C 7 6 5 4 3 2 1 2 3

A 8 7 6 5 4 3 2 2 2

b

a

Page 33: Pairwise Sequence Comparison

Chance or common ancestry?

Idea: calculate optimal alignment scores for pairs of sequences where one is a randomized (shuffled) version of the original. This will give a distribution of random scores, representing chance similarity rather than homology.

The score from our original pair of sequences can be referred to this distribution and assigned a Z-score (subtract mean of randoms and divide by SD of randoms), or (better) a p-value.

Criticism: Such random a.a. sequences might have plausible a.a. compositions but are quite unlike real protein sequences.

Partial reply: a) restrict the randomization to blocks; or, b) create a distribution of chance similarity scores using real a.a. sequences known or assumed not to be homologous to our query sequence. [Other approaches use theory, but this is still subject to the criticism above.]

Page 34: Pairwise Sequence Comparison

Dynamic Programming

Based on notes by George Rudy, formerly WEHI.

Page 35: Pairwise Sequence Comparison

“Life must be lived forwards and understood backwards.”

Søren Kierkegaard

Page 36: Pairwise Sequence Comparison

What is DP?

Operations research: “A mathematical formalism applicable to problems involving optimization of decisions over time.”

(after R. Bellman and S. Dreyfus)

Bioinformatics : “An algorithm for finding optimal sequence alignments given an additive alignment score.”

( after R. Durbin, et al.)

Computer programming: “An approach to algorithm design whereby the target problem is decomposed into smaller problems that are then solved independently.”

(after R. Sedgewick)

Page 37: Pairwise Sequence Comparison

Where did DP come from?

- Richard Bellman

- The RAND Corporation

- “Dynamic” and “Programming”

Page 38: Pairwise Sequence Comparison

Where can DP be applied?

- Both discrete and continuous problems concerning deterministic, stochastic, or adaptive processes

- Multiple fields: research, industry, finance,…

- Examples: allocation processes

smoothing and scheduling processes

optimal search and stopping techniques

optimal trajectories

multistage production processes

feedback control processes

Markovian decision processes

Page 39: Pairwise Sequence Comparison

DP in biomedical literature (1)

0

5

10

15

20

25

Years

Page 40: Pairwise Sequence Comparison

DP in biomedical literature (2)- A symmetric-iterated multiple alignment of protein sequences.

[Brocchieri, L. and Karlin S., J. Mol. Biol. 276(1):249-64, 1998.]

- Sequence assembly validation by multiple restriction digest fragment coverage analysis.

[Rouchka, E.C. and States, D.J., ISMB. 6:140-7, 1998.]

- Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment.

[Gracy, J. and Argos, P., Bioinformatics 14(2):164-73, 1998.]

- A segment-based dynamic programming algorithm for predicting gene structure.

[Wu, T.D., J. Comput. Biol. 3(3):375-94, 1996.]

- Automatic detection of cardiac contours on MR images using fuzzy logic and dynamic programming.

[Lalande A. et al., Proc. AMIA Annu. Fall Symp. :474-8, 1997.]

- Process models for production of beta-lactam antibiotics.

[Bellgardt, K.H., Adv. Biochem. Eng. Biotechnol. 60:153-94, 1998.]

- Dynamic programming approach for newborn’s incubator humidity control.

[Bouattoura, D. et al., IEEE Trans. Biomed. Eng. 45(1):48-55, 1998.]

- Minimum energy trajectories of the swing ankle when stepping over obstacles of different heights.

[Chou L.S. et al., J. Biomech. 30(2):115-20, 1997.]

- A theoretical study of the socioecology of ungulates. II. A dynamic programming study of the stochastic formulation.

[Paveri-Fontana, S.L. and Focardi, S. Theor. Popul. Biol. 46(3):279-99, 1994.]

Page 41: Pairwise Sequence Comparison

What problems are suitable for DP?

- Essential components (common to all OR problems):

a decision-maker

access to results of decisions

- Additionally:

decisions are sequential

later decisions are affected by earlier ones

effect of a decision can be calculated independently of other decisions

Page 42: Pairwise Sequence Comparison

The Stagecoach Problem (1)

A

C

H

E L

O

D

BF

I

M

G

J P

N

K[after S. E. Dreyfus]

1

5

2 3

5

1

2

4

2

0

4

1

2

3

4

8

2

4

7

1

3

5

2

2

Page 43: Pairwise Sequence Comparison

Some terminology

- Vertex

- Edge

- Path

-Monotonic-to-the-right

- (Admissible) path

- Stage

- State

Page 44: Pairwise Sequence Comparison

The Stagecoach Problem (2)

A

C

H

E L

O

D

BF

I

M

G

J P

N

K

1

5

2 3

5

1

2

4

2

0

4

1

2

3

4

8

2

4

7

1

3

5

2

2

0

Page 45: Pairwise Sequence Comparison

The Stagecoach Problem (2)

A

C

H

E L

O

D

BF

I

M

G

J P

N

K

1

5

2 3

5

1

2

4

2

0

4

1

2

3

4

8

2

4

7

1

3

5

2

2 2

1

0

Page 46: Pairwise Sequence Comparison

The Stagecoach Problem (2)

A

C

H

E L

O

D

BF

I

M

G

J P

N

K

1

5

2 3

5

1

2

4

2

0

4

1

2

3

4

8

2

4

7

1

3

5

2

2 2

4

1

0

Page 47: Pairwise Sequence Comparison

The Stagecoach Problem (2)

A

C

H

E L

O

D

BF

I

M

G

J P

N

K

1

5

2 3

5

1

2

4

2

0

4

1

2

3

4

8

2

4

7

1

3

5

2

2

10

8

7

2

4

6

7

5

1

0

Page 48: Pairwise Sequence Comparison

The Stagecoach Problem (2)

A

C

H

E L

O

D

BF

I

M

G

J P

N

K

1

5

2 3

5

1

2

4

2

0

4

1

2

3

4

8

2

4

7

1

3

5

2

2

10

9

12

13

14

8

8

7

2

4

6

11

7

5

1

0

Page 49: Pairwise Sequence Comparison

Some more terminology

- Optimal value function

- Policy

- Optimal policy function

Page 50: Pairwise Sequence Comparison

The Stagecoach Problem (3)

A

C

H

E L

O

D

BF

I

M

G

J P

N

K

1

5

2 3

5

1

2

4

2

0

4

1

2

3

4

8

2

4

7

1

3

5

2

2

10

9

12

13

14

8

8

7

2

4

6

11

7

5

1

0

Page 51: Pairwise Sequence Comparison

The Stagecoach Problem (3)

A

C

H

E L

O

D

BF

I

M

G

J P

N

K

1

5

2 3

5

1

2

4

2

0

4

1

2

3

4

8

2

4

7

1

3

5

2

2

10

9

12

13

14

8

8

7

2

4

6

11

7

5

1

0

Page 52: Pairwise Sequence Comparison

The Stagecoach Problem (3)

A

C

H

E L

O

D

BF

I

M

G

J P

N

K

1

5

2 3

5

1

2

4

2

0

4

1

2

3

4

8

2

4

7

1

3

5

2

2

10

9

12

13

14

8

8

7

2

4

6

11

7

5

1

0

Page 53: Pairwise Sequence Comparison

The Stagecoach Problem (4)

A

C

H

E L

O

D

BF

I

M

G

J P

N

K

1

5

2 3

5

1

2

4

2

0

4

1

2

3

4

8

2

4

7

1

3

5

2

2

10

9

12

13

14

8

8

7

2

4

6

11

7

5

1

0

Page 54: Pairwise Sequence Comparison

Efficiency of the DP approach

- At each of 9 vertices where a real choice existed: 2 additions

1 binary comparison

- At the other 6 vertices: 1 addition

Total: 24 additions

9 comparisons

- Compare this with direct evaluation of the original problem by enumeration of all 20 admissible paths:

5 additions/path = 100 additions 20 comparisons

Page 55: Pairwise Sequence Comparison

Efficiency (2), and the Curse of Dimensionality

In general, for the n-stage problem treated here,

DP involves (n2/2) + n additions

Direct enumeration generates paths, or

additions.

Thus, for n=20, DP requires 220 additions while direct enumeration would demand 3,510,364 additions.

n

n

2

⎝⎜

⎠⎟ =

n !n2⎛⎝

⎞⎠ ! n

2⎛⎝

⎞⎠ !

(n −1) n!n2⎛⎝

⎞⎠!n2⎛⎝

⎞⎠ !

Page 56: Pairwise Sequence Comparison

The Stagecoach Problem (5)

A

C

H

E L

O

D

BF

I

M

G

J P

N

K

y

x

1

2

3

-1

-2

-3

1 2 3 4 5 6

Page 57: Pairwise Sequence Comparison

The Principle of Optimality, or Bellman’s Principle

“An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.” (Bellman)

or, “An optimal sequence of decisions in a multistage decision process problem has the property that whatever the initial stage, state, and decision are, the remaining decisions must constitute an optimal sequence of decisions for the remaining problem, with the stage and state resulting from the first decision considered as initial conditions.” (Dreyfus)

or, “An optimal policy must have the property that no matter what path is taken to enter a particular state, the remaining stages (decisions) taken must constitute an optimal policy for departure from that state.”

or, “An optimal policy is comprised of optimal subpolicies.”

or, “An optimal policy from any state is independent of the path taken to that state, and is made up entirely of optimal subpolicies.”

or, ...

Page 58: Pairwise Sequence Comparison

The optimal value function

S(x,y) = the value of the minimum-value admissible path connecting the vertex (x,y) and the terminal vertex (6,0)

eu(x,y) = the value of the edge connecting the vertices (x,y) and

(x+1, y+1)

ed(x,y) = the value of the edge connecting the vertices (x,y) and

(x+1, y-1)

S(x,y) = min {eu(x,y) + S(x+1, y+1), ed(x,y) + S(x+1, y-1)}

S(6,0) = 0.

Page 59: Pairwise Sequence Comparison

A more formal restatement of common features of DP problems

A physical system characterized at any stage by a small set of parameters, the state variables;

At each stage of the process there is a choice of a number of decisions;

The effect of a decision is a transformation of the state variables;

The past history of the system is of no importance in determining future actions;

The purpose of the process is to maximize some function of the state variables.

Page 60: Pairwise Sequence Comparison

The practice of DP

Imbed the specific given problem in a more general family of problems;

Define the optimal value function which associates a value with each of the various possible initial conditions of problems in that family;

Invoke the principle of optimality in order to deduce a recurrence relation characterizing that function;

Seek the solution of the recurrence relation in order to obtain the optimal policy function which furnishes the solution to the specific given problem and all other problems in the more general family as well.

Page 61: Pairwise Sequence Comparison

More practically speaking,Determine the decision-maker and the decisions to be made;

Determine the stages;

Determine the possible states;

Formulate the optimal value function in the form of a recurrence relation;

Calculate and tabulate the optimal value function for each stage and state;

Find the optimal policy (ies) for the problem.

Page 62: Pairwise Sequence Comparison

New problem, new terminology

Edit operations: M(atch), R(eplacement), I(nsert), D(elete).

Edit transcript: A string over the alphabet M, R, I, D that describes a transformation of one string into another. Example:

R D I M D MR D I M D M

M A - T H S

A - R T - S

Edit (Levens(h)tein) distance: The minimum number of edit operations necessary to transform one string into another. (Note: matches are not counted.) Example:

R D I M D MR D I M D M

1+ 1+ 1+ 0+ 1+ 0 = 4

Page 63: Pairwise Sequence Comparison

Once again,

Imbed the problem in the more general family;

Define the optimal value function;

Deduce the recurrence relation;

Solve for the recurrence relation to obtain the optimal policy function.

Page 64: Pairwise Sequence Comparison

The recurrence

Stage: position in the edit transcript;

State: I, D, M, or R;

Optimal value function: D(i, j)

where D(i, j) = edit distance of Seq1[1...i] and Seq2[1...j]

Recurrence relation:

D(i, j) = min {1 + D(i-1, j),1 + D(i, j-1), t(i, j) + D(i-1, j-1) } ,

where t(i, j) = 0 if Seq1(I) = Seq2(j), and =1 otherwise.

Page 65: Pairwise Sequence Comparison

The tabulation , D(i, j)

Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0

M 1

A 2

T 3

H 4

S 5

Page 66: Pairwise Sequence Comparison

The tabulation , D(i, j)

Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0

M 1

A 2

T 3

H 4

S 5

Page 67: Pairwise Sequence Comparison

The tabulation , D(i, j)

Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1

M 1

A 2

T 3

H 4

S 5

Page 68: Pairwise Sequence Comparison

The tabulation , D(i, j)

Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2

M 1

A 2

T 3

H 4

S 5

Page 69: Pairwise Sequence Comparison

The tabulation , D(i, j)

Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1

A 2 2

T 3 3

H 4 4

S 5 5

Page 70: Pairwise Sequence Comparison

The tabulation , D(i, j)

Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1

A 2 2

T 3 3

H 4 4

S 5 5

Page 71: Pairwise Sequence Comparison

The tabulation , D(i, j)

Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2

A 2 2

T 3 3

H 4 4

S 5 5

Page 72: Pairwise Sequence Comparison

The tabulation , D(i, j)

Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3

H 4 4

S 5 5

Page 73: Pairwise Sequence Comparison

The tabulation , D(i, j)

Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4

S 5 5

Page 74: Pairwise Sequence Comparison

The tabulation , D(i, j)

Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4 3 3 3 3

S 5 5 4 4 4 3

Page 75: Pairwise Sequence Comparison

The traceback

Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4 3 3 3 3

S 5 5 4 4 4 3

Page 76: Pairwise Sequence Comparison

The solutions - #1

1 0 1 1 0 = 3

DD MM RR RR MM

M A T H S

- A R T S

Page 77: Pairwise Sequence Comparison

The traceback

Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4 3 3 3 3

S 5 5 4 4 4 3

Page 78: Pairwise Sequence Comparison

The solutions - #2

1 0 1 0 1 0 = 3

DD MM II MM DD MM

M A - T H S

- A R T - S

Page 79: Pairwise Sequence Comparison

The traceback

Seq2(j) A R T S

Seq1(i) 0 1 2 3 4

0 0 1 2 3 4

M 1 1 1 2 3 4

A 2 2 1 2 3 4

T 3 3 2 2 2 3

H 4 4 3 3 3 3

S 5 5 4 4 4 3

Page 80: Pairwise Sequence Comparison

The solutions - #3

1 1 0 1 0 = 3

RR RR MM DD MM

M A T H S

A R T - S

Page 81: Pairwise Sequence Comparison

DP, in general (well, for a discrete, deterministic, additive process, anyway)

F(t, s) = Opt {r(t, s, x) + aF(t´, s´) : x in X(t, s) and s´ = T(t, s, x)}

Need not be additive. When a stochastic process, r and F are expected values; the state transform is random with a probability distribution

P[T(t, s, x) = s´ | s, x]’, and

F(t´, s´) is replaced by

∑s´ {F(t´, s´) P[T(t, s, x) = s´ | s, x]}

Page 82: Pairwise Sequence Comparison

“Life must be lived forwards and understood backwards.”

Søren Kierkegaard