Lecture 12 Sequence Alignment 2

BINF6201/8201

Sequence alignment algorithms 2

10-15-2013

Global alignment vs. local alignmentNeedleman-Wunsch algorithm gives the optimal alignment of two

sequences using the entire region of the two sequences, therefore, the resulting alignment is a global alignment. We compute the global alignment of two sequences when we believeWe compute the global alignment of two sequences when we believe

that the domain arrangements of the two sequence are similar.a

b

Global alignmentHowever, very often, we are more interested in aligning the sub-

regions/domains of two sequences. In such cases, we want to find the

Global alignment

local optimal alignment between two sequences.a

b

Local alignment Non-alignable regions

Smith-Waterman local alignment algorithmSmith-Waterman algorithm (1981) uses dynamic programming to find

the optimal local alignment between two sequences.The algorithm is modified from the Needleman-Wunsch algorithm.To identify the optimal local alignment, the algorithm terminates a

ongoing alignment between the first i letters of a and the first j letters of b if the alignment is not promising i e when H(i j) is negative andof b, if the alignment is not promising, i.e., when H(i,j) is negative, and restarts another new alignment by assigning H(i,j)=0. Therefore, H(i,j) is the score for the alignment starting from the last

terminating point to the i-th and j-th positions in a and b, respectively.Maximal H(i,j)Optimal local alignment

Terminate the Terminate the Alignment containsTerminate the alignment, if H(i,j)< 0, and assign H(i,j) = 0

Terminate the alignment, if H(i,j)< 0, and assign H(i,j) = 0

Alignment contains the optimal local alignment

Smith-Waterman local alignment algorithm If we use the linear gap penalty function, then the recursion relation of

the Smith-Waterman algorithm is,

+ diagonal ),()1,1( baSjiH ji

= horizontal )1,(

vertical ),1(max),(

gjiHgjiHjiH

with the initial condition H(i,0)=0, and H(j,0)=0. resart 0

The optimal local alignments can be identified by the cells in theThe optimal local alignments can be identified by the cells in the alignment matrix that has the maximal score.The alignment is recovered by backtracking starting from this cell untilThe alignment is recovered by backtracking starting from this cell until

a zero is encountered. Of course, the track of how H(i,j) is computed needs to be stored in another matrix.As in the case of global alignment, this algorithm has the time

complexity of O(mn) .

Smith-Waterman local alignment algorithm Initialize the first row and column, and then compute each cell from

... ... : 1210 bbbbbbb njjthe upper left corner to the bottom right corner.

00 0 0 0 0 0 :

1

0

a... ... aa

),( 11 bas+ g)11(H

0 0

2

1

aa g )1,1(H

0

1 a... ...

i )1,1( jiH ),1( jiH g0

1

aii

)1,( jiH),( ji baS+

gg),( jiH

0

a... ...

m

Smith-Waterman local alignment algorithm Using our toy sequences,

SHAKE d b SPEARE

we can compute the following alignment matrix using a linear gap penalty function W(l) = 6l

a: SHAKE and b: SPEARE,

penalty function, W(l) = -6l.

Amino acid alignment scores taken from PAM250taken from PAM250

SHAKEPEARE

Through backtracking, we obtain the following local alignment:

Alignment algorithms with a general gap penalty If the gap penalty function is in the general form W(l), when filling a

cell in the alignment matrix by a horizontal or vertical move, we need to determine the optimal value of l, i.e., how many spaces should have been inserted before the current one.been inserted before the current one. The recursion relation for the global alignment is given by,

+ di l)()11( bSjiH

+=

cal verti)}](),([max

diagonal ),()1,1(max),(

1lWjliHbaSjiH

jiHil

ji

horizontal )](),([max1 lWljiHjlwith the initial condition H(0, 0)=0, H(i, 0) = W(i) and H(0, j) = W(j).

To compute a cell, up to i+j+1 calculations have to be made, which have a time complexity O(n+m) or O(n).

with the initial condition H(0, 0) 0, H(i, 0) W(i) and H(0, j) W(j).

p y ( ) ( )Therefore the entire algorithm run in O(n3), which is considered to be

too slow, though it is a polynomial algorithm.

Alignment algorithms with a general gap penalty The recursion relation for the local alignment version is,

+

= rtical ve] )(),([max

diagonal ),()1,1(

max),( 1lWjliHbaSjiH

jiH il

ji

restart 0

horizontal ])(),([max),(

1lWljiH

jijl

To avoid too many adjacent short gaps separated by too short alignments, the following relation must be hold,

with the initial condition H(i, 0)= H(0, j)= 0.

g , g ,).()()( 2121 lWlWllW ++

ml2sl1ml1+l2Favorable Unfavorable

That is, the penalty for a long gap should not be greater than the penalty for two short ones that add up to the same length In general

alignment alignment

penalty for two short ones that add up to the same length. In general, this relation holds if the open gap penalty is larger than any mismatch and gap extension.

Alignment algorithms with affine gap penaltyBecause using a general form of gap penalty function slows down the

algorithm, an affine gap penalty function is preferred.When using affine gap penalty function in dynamic programming, we

only need to differentiate between the case that the gap is first beingonly need to differentiate between the case that the gap is first being introduced and the case that the gap is being extended.Let M(i,j) be the score of the best alignment up to the i-th letter in aLet M(i,j) be the score of the best alignment up to the i th letter in a

and j-th letter in b, and ai is aligned with bj.

i-1 ia

Let I(i j) be the score of the best alignment up to the i th letter in a and

i-1j-1

_ij ),( jiM

ab

Let I(i,j) be the score of the best alignment up to the i-th letter in a and j-th letter in b, and ai is aligned with a space.

i-1j

i ),( jiIab

Alignment algorithms with affine gap penaltyLet J (i,j) be the score of the best alignment up to the i-th letter in a( ,j) g p

and j-th letter in b, and bj is aligned with a space. i

j 1 j),( jiJa

b j-1 j The score of the best alignment to this point is the best of the three

cases, ))()()(max()( jiJjiIjiMjiH =

b

Therefore, we need to fill four or at least three separate matrices. To compute M (i j) we consider the following three possibilities

, )).,(),,(),,(max(),( jiJjiIjiMjiH =

To compute M (i,j), we consider the following three possibilities,ij ),(

)1,1(),(

1 jbaSjiMjiM

+=i-1

j-1i-2j-2

ab

ai-1 aligns with bj-1, so we extend an alignment

j

ij ),(

)1,1(),(

1 jbaSjiIjiM

+=i-1i-2

j-1ab

ai-1 aligns with a space, so we end a gap.

j

ij ),(

)1,1(),(

1 jbaSjiJjiM

+=i-1

j-2ab j-1

bi-1 aligns with a space, so we end a gap.

Alignment algorithms with affine gap penalty Therefore, we have the following recursion to compute M(i,j),

+++

=)()11(

),()1,1(

),()1,1(

max),(

1

1

1

j

j

j

baSjiJbaSjiI

baSjiMjiM

Initialization: M(0,0)= 0, M(0, j)= - and M(i,0)= -

+ ),()1,1( 1 jbaSjiJ To compute I(i,j), we consider the following two possibilities,

i MI )1()(i-1ai 1 aligns with bj, so i-2a opengjiMjiI = ),1(),(ji

extgjiIjiI = ),1(),(i-1

ai-1 a g s w t bj, sowe open a gap.

ai-1 aligns with a space, d h

j-1bi-2a

extgjj ),(),(so we extend the gap. jbWe do not need to consider the following case, as it never happens by

design (a high cost for opening a gap).

Therefore we have the following recursion to compute I(i j)

design (a high cost for opening a gap). i

jbj aligns with a space

i-1j-1

ab

Therefore, we have the following recursion to compute I(i, j),

=

ext

open

gjiIgjiM

jiI),1(

),1(max),( Initialization: I(0,0)=-, I(0,j)=-, I(1,j)= -gopen,

I(i,0)=-gopen - gext (i-1)

Alignment algorithms with affine gap penaltySimilarly, to compute J(i,j), we consider the following two possibilities,

j opengjiMjij = )1,(),(ij-1

bi-1 aligns with ai, so, we open a gap.

i-1j-2

ab

j extgjiJjiJ = )1,(),(j-1bi-1 aligns with a space, so, we extent the gap.

ij-1

ab

Therefore, we have the following recursion to compute J(i, j),

= opengjiMjiJ )1,(max)(Initialization: J(0 0)= J(i 0) = J(i 1) = g

=

extgjiJjiJ

)1,(max),(

In this algorithm design, I or J type alignments can only be followed b h f li M li h

J(0,0)= -, J(i, 0) = -, J(i, 1) = -gopen, J(0, j) = -gopen - gext (j-1)

by the same type of alignment or M type alignment, thus we prevent alignments where a space in one sequence is immediately followed by a space in another sequence, such as,p q , ,

S--HAKESPE-ARE

Alignment algorithms with affine gap penaltyTo see this, consider the following relation by setting gopen to values, g y g gopen

large enough, so that,S(a,b) > - gopen

Therefore, alignment (I) always scores higher than (II), because S(H,E)>-gopen. S--HAKE(I) S-HAKE

As before, by adding the restart option and initialization conditions to

S HAKESPE-ARE(I)

S HAKESPEARE (II)

this global alignment algorithm, we can produce an algorithm for local alignment.

diagonal ),( jiM

=0

horizontal ),( rtical ve),(

max),(jiJjiI

jiH

restart 0with the initial condition H(i, 0)= H(0, j)= 0.

Effect of scoring parameters on the alignmentAlthough dynamic

iprogramming algorithm guarantee the optimal palignment between two sequences under the given scoringthe given scoring system, if we change the scoring system, diff t lt illdifferent results will be resulted. Shown left are Shown left are

alignments between human and yeast h ki t ihexokinase proteins using different gap open penalties.

On line pairwise alignment programsNeedleman-Wunsch algorithm with an affine gap penalty function has

been implemented by a few groups and the programs are freely available in both standalones or web-based applications.

h hSuch as the GGSEARCHprogram by EBI p g yhttp://www.ebi.ac.uk/Tools/fasta33/index.h l? GGShtml?program=GGSEARCH

On line pairwise alignment programsSmith-Waterman algorithm with an affine gap penalty function has

been also implemented by a few groups and programs are freely available in both standalones or web-based applications.

h hSuch as the SSEARCHprogram: p gWeb-based application: h // bi khttp://www.ebi.ac.uk/Tools/fasta33/index.html?program=SSEp gARCH

On line pairwise alignment programsThe MPsrch program: another implementation of the Smith-Waterman

l i h i h ll li ialgorithm with parallelization.Web-based application: http://www.ebi.ac.uk/Tools/MPsrch/index.html

St d lStandalone program: http://www.ebi.ac.uk/Tools/webservices/clients/mpsrch

Multiple sequence alignment

M lti l li t (MSA) li t f th th Multiple sequence alignment (MSA): alignment of more than three sequences, such that the alignment have the maximal score given a scoring matrix and gap penalty function.

Theoretically, MSA can be solved by multidimensional dynamic programming, however it has time complexity O(NS) for aligning S

f l th N it l b li d t fsequences of length N, so it can only be applied to few sequences. In fact, it has been shown that MAS is a NP-hard problem, therefore,

there is no known efficient algorithm to solve it.there is no known efficient algorithm to solve it. Various heuristic algorithms have been proposed to align multiple

sequences. They generally perform well when the sequences to be aligned are not too distantly related to one another.

The most of these heuristic algorithms, such as the ClustalX/W algorithm se a progressi e alignment method to align m ltiplealgorithm, use a progressive alignment method to align multiple sequences.

Progressive alignment algorithmThis method starts with the most confident pairwise alignment, and p g

then gradually add each sequence or groups of sequences to the already aligned MSA using a guide tree.F l th Cl t l l ith fi t t t h l ti tFor example, the Clustal algorithm first constructs a phylogenetic tree

of the sequences to be aligned using the pairwise alignment of the sequences. The evolutionary distance between two sequences can be estimated by

the Kimura estimator,d = -ln(1-D-0 2D2)d -ln(1-D-0.2D )

Evolutionary distance can be also calculated from the alignment score using Feng and Doolittle formula,g g

)].ln()[ln(100 randidentrand SSSSd =where S is the average score to align two random sequences S iswhere Srand is the average score to align two random sequences, Sident is the average score to align two identical sequences.

Progressive alignment algorithmClustal uses the neighbor-joining

method to construct the tree using the computed evolutionary distance matrix.Using the tree as a guide Clustal alignsUsing the tree as a guide, Clustal aligns

the two pairs of sequences that has shortest evolutionary distance, i.e., HXK2 RAT and HXK2 HUMAN, and HXK1 RAT and HXK1 HUMAN, using the Needleman-Wunsch globalusing the Needleman Wunsch global alignment algorithm.This two pairwise alignments are then

aligned to form a cluster of four sequences.This process is repeated until allThis process is repeated until all

clusters are joined to form a single cluster.

Alignment algorithm of two clusters of sequencesThe algorithm forThe algorithm for

aligning two clusters of sequences are

ti ll thessentially the same as aligning two sequences, but all the sequences in a cluster are treated as if they are a single sequencesequence. If a space is introduced

in the cluster, it will be inserted at the same position in all sequences in the clustersequences in the cluster.

Alignment algorithm of two clusters of sequencesTo score the aligned sites i and j in two clusters, we can use the g j ,

average of the individual scores for the amino acid pairs that can be formed between the clusters

1 1 2n n tk ,)](),([1)](2),(1[1 1

2121= =

=k t

tk jaiaSnn

jclustericlusterS

where n1 and n2 are the number of sequences, and ak1(i) and at2(j) are

1 2 q , 1( ) 2(j)the amino acids at the sites i and j in the k-th and t-th sequences in clusters 1 and 2, respectively. For example, the score for aligning the following sites from two

clusters, PAPosition i in cluster 1

ld b

AIRPosition j in cluster 2

would be)].,(),(),(),([

221)](2),(1[ RASIASRPSIPSjclustericlusterS +++=

Problems of progressive alignment algorithms Because the heuristic nature of the progressive multiple algorithm, p g p g ,

global optimal is not guaranteed.

In particular the errors made by an earlier step cannot be corrected by In particular, the errors made by an earlier step cannot be corrected by the later steps.

T id h bi d b li i l d l d iTo avoid the bias caused by aligning very closed-related sequences in the earlier steps, Clustal uses a weighted scoring system, i.e., smaller weight is given to closed related sequences when computing the g g q p galignment scores.

)]()([)](2)(1[1 2

2,1 = n n tk jaiaSwjclustericlusterSSometimes, manual adjustment is needed in the regions that are not

li d ll

,)](),([)](2),(1[1 1

2121= =

=k t

jaiaSnn

jclustericlusterS

aligned very well.

Online multiple sequence alignment programsThe popular multiple sequence alignment programs include:The popular multiple sequence alignment programs include:

1. ClustalX/W: http://www.clustal.org/

2. T-Coffee: http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html

3. MUSCLE: http://www.drive5.com/muscle/p

Since MSA algorithms are still an active research area, new algorithms and programs are expected in the future. p g pA recent development is the SATe algorithm, which iteratively

constructs the guide tree and the alignment until a convergent criterion i t htt // l k d / j / t / t ht lis met: http://people.ku.edu/~jyu/sate/sate.html

Lecture 12 Sequence Alignment 2

Documents