Top Banner
Expected accuracy sequence alignment Usman Roshan
23
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Expected accuracy sequence alignment Usman Roshan.

Expected accuracy sequence alignment

Usman Roshan

Page 2: Expected accuracy sequence alignment Usman Roshan.

Optimal pairwise alignment

• Sum of pairs (SP) optimization: find the alignment of two sequences that maximizes the similarity score given an arbitrary cost matrix. We can find the optimal alignment in O(mn) time and space using the Needleman-Wunsch algorithm.

• Recursion: Traceback:

M(i, j) =

M(i −1, j −1) + s(x i, y j )

M(i, j −1) + g

M(i −1, j) + g

⎨ ⎪

⎩ ⎪

where M(i,j) is the score of the optimal alignment of x1..i and y1..j, s(xi,yj) is a substitution scoring matrix, and g is the gap penalty

Page 3: Expected accuracy sequence alignment Usman Roshan.

Affine gap penalties

• Affine gap model allows for long insertions in distant proteins by charging a lower penalty for extension gaps. We define g as the gap open penalty (first gap) and e as the gap extension penalty (additional gaps)

• Alignment:– ACACCCT ACACCCC– ACCT T AC CTT– Score = 0 Score = 0.9

• Trivial cost matrix: match=+1, mismatch=0, gapopen=-2, gapextension=-0.1

Page 4: Expected accuracy sequence alignment Usman Roshan.

Affine penalty recursion

V (i, j) = max{E(i, j),F(i, j), M(i, j)

M(i, j) = V (i −1, j −1) + s(x i,y j )

E(i, j) = max{E(i, j −1) + ext,V (i, j −1) + g}

F(i, j) = max{F(i −1, j) + ext,V (i −1, j) + g}

M(i,j) denotes alignments of x1..i and y1..j ending witha match/mismatch. E(i,j) denotes alignments of x1..i

and y1..j such that yj is paired with a gap. F(i,j) definedsimilarly. Recursion takes O(mn) time where m and n are lengths of x and y respectively.

Page 5: Expected accuracy sequence alignment Usman Roshan.

Expected accuracy alignment

• The dynamic programming formulation allows us to find the optimal alignment defined by a scoring matrix and gap penalties. This may not necessarily be the most “accurate” or biologically informative.

• We now look at a different formulation of alignment that allows us to compute the most accurate one instead of the optimal one.

Page 6: Expected accuracy sequence alignment Usman Roshan.

Posterior probability of xi aligned to yj

• Let A be the set of all alignments of sequences x and y, and define P(a|x,y) to be the probability that alignment a (of x and y) is the true alignment a*.

• We define the posterior probability of the ith residue of x (xi) aligning to the jth residue of y (yj) in the true alignment (a*) of x and y as

Do et. al., Genome Research, 2005

P(x i ~ y j ∈ a* | x,y) = P(a | x,y)1{x i ~ y j ∈ a}a∈A

Page 7: Expected accuracy sequence alignment Usman Roshan.

Expected accuracy of alignment

• We can define the expected accuracy of an alignment a as

• The maximum expected accuracy alignment can be obtained by the same dynamic programming algorithm

V i j

V i j P x y

V i j

V i j

i j

( , ) max

( , ) ( ~ )

( , )

( , )

=− − +

−−

⎨⎪

⎩⎪

⎬⎪

⎭⎪

1 111

Do et. al., Genome Research, 2005

Page 8: Expected accuracy sequence alignment Usman Roshan.

Example for expected accuracy

• True alignment• AC_CG• ACCCA• Expected accuracy=(1+1+0+1+1)/4=1

• Estimated alignment• ACC_G• ACCCA• Expected accuracy=(1+1+0.1+0+1) ~ 0.75

Page 9: Expected accuracy sequence alignment Usman Roshan.

Estimating posterior probabilities• If correct posterior probabilities can be computed

then we can compute the correct alignment. Now it remains to estimate these probabilities from the data

• PROBCONS (Do et. al., Genome Research 2006): estimate probabilities from pairwise HMMs using forward and backward recursions (as defined in Durbin et. al. 1998)

• Probalign (Roshan and Livesay, Bioinformatics 2006): estimate probabilities using partition function dynamic programming matrices

Page 10: Expected accuracy sequence alignment Usman Roshan.

Partition function posterior probabilities

• Standard alignment score:

• Probability of alignment (Miyazawa, Prot. Eng. 1995)

• If we knew the alignment partition function then

SaT Mff gappenaltiesij i jija() ln( / )( _ )(,)= +∈∑Pa eSaT() ()/∝PaTe ZTSaT(,) /()()/=

Page 11: Expected accuracy sequence alignment Usman Roshan.

Partition function posterior probabilities

• Alignment partition function (Miyazawa, Prot. Eng. 1995)

• SubsequentlyZ e e eijM SaTaA

S aTaA

sxyTijij

i ii j

i j, ()/ ()/ (, )/,= =⎛⎝⎜⎜ ⎞⎠⎟⎟∈ ∈∑ ∑−−−−

1111

Z(T) = eS(a ) /T

a∈A

Page 12: Expected accuracy sequence alignment Usman Roshan.

Partition function posterior probabilities

• More generally the forward partition function matrices are calculated as

Z Z Z Z eZZe ZeZZe ZeZZZZijM i jM i jE i jF sxyTijE ijMgT ijEextTijF i jMgT i jFextTij ijMijE ijF

i j, , . , (, )/, , / . /, , / . /, , , ,

( )=+ += += +=++−− −− −−− −− −11 11 111 11 1

Page 13: Expected accuracy sequence alignment Usman Roshan.

Partition function matrices vs. standard affine recursionsZ Z Z Z eZZe ZeZZe ZeZZZZijM i jM i jE i jF sxyTijE ijMgT ijEextTijF i jMgT i jFextTij ijMijE ijF

i j, , . , (, )/, , / . /, , / . /, , , ,

( )=+ += += +=++−− −− −−− −− −11 11 111 11 1

V (i, j) = max{E(i, j),F(i, j),M(i, j)

M(i, j) = V (i −1, j −1) + s(x i,y j )

E(i, j) = max{E(i, j −1) + ext,V (i, j −1) + g}

F(i, j) = max{F(i −1, j) + ext,V (i −1, j) + g}

Page 14: Expected accuracy sequence alignment Usman Roshan.

Posterior probability calculation

• If we defined Z’ as the “backward” partition function matrices then

Pxy Z ZZ ei j i jM i jM sxyTi j(~) ', , (, )/=−−++11 11

Page 15: Expected accuracy sequence alignment Usman Roshan.

Posterior probabilities using alignment ensembles

• By generating an ensemble A(n,x,y) of n alignments of x and y we can estimate P(xi~yj) by counting the number of times xi is aligned to yj.. Note that this means we are assigning equal weights to all alignments in the ensemble.

P(x i ~ y j ∈ a* | x,y) = P(a | x,y)1{x i ~ y j ∈ a}a∈A

Page 16: Expected accuracy sequence alignment Usman Roshan.

Generating ensemble of alignments

• We can use stochastic backtracking (Muckstein et. al., Bioinformatics, 2002) to generate a given number of optimal and suboptimal alignments.

• At every step in the traceback we assign a probability to each of the three possible positions.

• This allows us to “sample” alignments from their partition function probability distribution.

• Posteror probabilities turn out to be the same when calculated using forward and backward partition function matrices.

Page 17: Expected accuracy sequence alignment Usman Roshan.

Probalign1. For each pair of sequences (x,y) in the input set

– a. Compute partition function matrices Z(T)– b. Estimate posterior probability matrix P(xi ~ yj) for (x,y)

by

2. Perform the probabilistic consistency transformation and compute a maximal expected accuracy multiple alignment: align sequence profiles along a guide-tree and follow by iterative refinement (Do et. al.).

Pxy Z ZZ ei j i jM i jM sxyTi j(~) ', , (, )/=−−++11 11

V i j

V i j P x y

V i j

V i j

i j

( , ) max

( , ) ( ~ )

( , )

( , )

=− − +

−−

⎨⎪

⎩⎪

⎬⎪

⎭⎪

1 111

Page 18: Expected accuracy sequence alignment Usman Roshan.

Multiple protein alignment

• Protein sequence alignment: hard problem for multiple distantly related proteins

• Several standard protein alignment benchmarks available: BAliBASE, HOMSTRAD, OXBENCH, PREFAB, and SABMARK

• Benchmark alignments are based on manual and computational structural alignment of proteins with known structure.

Page 19: Expected accuracy sequence alignment Usman Roshan.

Measure of accuracy

• Sum-of-pairs score: number of correctly aligned pairs divided by number of pairs in true alignment.

• Column score: number of correctly aligned columns

• Statistical significance using Friedman rank test

AACAGTAAGT_ _

AACAGTAA_ _GT

Blue: correctRed: incorrectAcc: 2/4=50%

Page 20: Expected accuracy sequence alignment Usman Roshan.

Experimental design

• Methods compared:– Probalign– PROBCONS– MUSCLE– MAFFT

• Probalign temperature parameter trained on RV11 subset of BAliBASE 3.0.

• Default (optimized) parameters for remaining programs

• All experiments performed on CIPRES cluster at SDSC

Page 21: Expected accuracy sequence alignment Usman Roshan.

BAliBASE 3.0

Data Probalign MAFFT Probcons MUSCLE

RV11 69.3 / 45.3 67.1 / 44.6 67.0 / 41.7 59.3 / 35.9

RV12 94.6 / 86.2 93.6 / 83.8 94.1 / 85.5 91.7 / 80.4

RV20 92.6 / 43.9 92.7 / 45.3 91.7 / 40.6 89.2 / 35.1

RV30 85.2 / 56.4 85.6 / 56.9 84.5 / 54.4 80.3 / 38.3

RV40 92.2 / 60.3 92.0 / 59.7 90.3 / 53.2 86.7 / 47.1

RV50 89.3 / 55.2 90.0 / 56.2 89.4 / 57.3 85.7 / 48.7

All 87.6 / 58.9 87.1 / 58.6 86.4 / 55.8 82.5 / 48.5

Method RV11 RV12 RV20 RV30 RV40 RV50 All

MAFFT NS < 0.005 NS NS < 0.005 NS < 0.005

Probcons 0.049 0.0233 NS NS < 0.005 NS < 0.005

MUSCLE < 0.005 < 0.005 0.008 < 0.005 < 0.005 NS < 0.005

Sum-of-pairs and column score accuracies

Friedman rank test P-values

Page 22: Expected accuracy sequence alignment Usman Roshan.

Heterogeneous length data I

Max length /Standard dev.

Probalign MAFFT Probcons MUSCLE

500 / 100 88.4 / 56.6 88.0 / 58.0 86.7 / 51.6 81.5 / 42.5

500 / 200 88.5 / 54.6 87.0 / 51.9 87.2 / 48.9 81.9 / 42.4

1000 / 100 91.4 / 58.1 90.4 / 55.7 89.7 / 51.6 84.3 / 44.1

1000 / 200 90.7 / 55.0 89.3 / 51.4 89.2 / 48.7 83.2 / 42.5

RV40 1000 / 100 (25) 1000 / 200 (20)

92.7 / 59.393.0 / 57.3

91.0 / 54.890.8 / 52.1

89.9 / 48.290.6 / 47.6

BAliBASE datasets with maximum length and minimum devation

BAliBASE datasets with long extensions

Max length /Standard dev.

Probalign MAFFT Probcons

Page 23: Expected accuracy sequence alignment Usman Roshan.

Heterogeneous length data II

Max length /Standard dev.

Probalign MAFFT Probcons

500 / 100 (40) 89.1 / 44.9 87.3 / 49.0 87.4 / 38.6

500 / 200 (21) 88.3 / 43.8 85.0 / 46.4 86.7 / 40.0

500 / 300 (9) 95.3 / 61.0 82.6 / 51.3 87.3 / 46.6

500 / 400 (5) 94.6 / 55.0 72.0 / 38.2 79.8 / 38.0

1000 / 100 (15) 90.2 / 43.3 82.4 / 36.9 85.4 / 27.6

1000 / 200 (12) 89.2 / 38.2 79.7 / 32.4 83.6 / 27.7

1000 / 300 (7) 94.5 / 52.8 78.3 / 42.4 83.9 / 34.6

1000 / 400 (5) 94.6 / 55.0 72.0 / 38.2 79.8 / 38.0

BAliBASE 2.0 reference 6 datasets with max length and minimum deviation