Top Banner
BINF6201/8201 Sequence alignment algorithms 2 10-15-2013
24

Lecture 12 Sequence Alignment 2

Nov 23, 2015

Download

Documents

Barar Ciprian

fdf
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • BINF6201/8201

    Sequence alignment algorithms 2

    10-15-2013

  • Global alignment vs. local alignmentNeedleman-Wunsch algorithm gives the optimal alignment of two

    sequences using the entire region of the two sequences, therefore, the resulting alignment is a global alignment. We compute the global alignment of two sequences when we believeWe compute the global alignment of two sequences when we believe

    that the domain arrangements of the two sequence are similar.a

    b

    Global alignmentHowever, very often, we are more interested in aligning the sub-

    regions/domains of two sequences. In such cases, we want to find the

    Global alignment

    local optimal alignment between two sequences.a

    b

    Local alignment Non-alignable regions

  • Smith-Waterman local alignment algorithmSmith-Waterman algorithm (1981) uses dynamic programming to find

    the optimal local alignment between two sequences.The algorithm is modified from the Needleman-Wunsch algorithm.To identify the optimal local alignment, the algorithm terminates a

    ongoing alignment between the first i letters of a and the first j letters of b if the alignment is not promising i e when H(i j) is negative andof b, if the alignment is not promising, i.e., when H(i,j) is negative, and restarts another new alignment by assigning H(i,j)=0. Therefore, H(i,j) is the score for the alignment starting from the last

    terminating point to the i-th and j-th positions in a and b, respectively.Maximal H(i,j)Optimal local alignment

    Terminate the Terminate the Alignment containsTerminate the alignment, if H(i,j)< 0, and assign H(i,j) = 0

    Terminate the alignment, if H(i,j)< 0, and assign H(i,j) = 0

    Alignment contains the optimal local alignment

  • Smith-Waterman local alignment algorithm If we use the linear gap penalty function, then the recursion relation of

    the Smith-Waterman algorithm is,

    + diagonal ),()1,1( baSjiH ji

    = horizontal )1,(

    vertical ),1(max),(

    gjiHgjiHjiH

    with the initial condition H(i,0)=0, and H(j,0)=0. resart 0

    The optimal local alignments can be identified by the cells in theThe optimal local alignments can be identified by the cells in the alignment matrix that has the maximal score.The alignment is recovered by backtracking starting from this cell untilThe alignment is recovered by backtracking starting from this cell until

    a zero is encountered. Of course, the track of how H(i,j) is computed needs to be stored in another matrix.As in the case of global alignment, this algorithm has the time

    complexity of O(mn) .

  • Smith-Waterman local alignment algorithm Initialize the first row and column, and then compute each cell from

    ... ... : 1210 bbbbbbb njjthe upper left corner to the bottom right corner.

    00 0 0 0 0 0 :

    1

    0

    a... ... aa

    ),( 11 bas+ g)11(H

    0 0

    2

    1

    aa g )1,1(H

    0

    1 a... ...

    i )1,1( jiH ),1( jiH g0

    1

    aii

    )1,( jiH),( ji baS+

    gg),( jiH

    0

    a... ...

    m

  • Smith-Waterman local alignment algorithm Using our toy sequences,

    SHAKE d b SPEARE

    we can compute the following alignment matrix using a linear gap penalty function W(l) = 6l

    a: SHAKE and b: SPEARE,

    penalty function, W(l) = -6l.

    Amino acid alignment scores taken from PAM250taken from PAM250

    SHAKEPEARE

    Through backtracking, we obtain the following local alignment:

  • Alignment algorithms with a general gap penalty If the gap penalty function is in the general form W(l), when filling a

    cell in the alignment matrix by a horizontal or vertical move, we need to determine the optimal value of l, i.e., how many spaces should have been inserted before the current one.been inserted before the current one. The recursion relation for the global alignment is given by,

    + di l)()11( bSjiH

    +=

    cal verti)}](),([max

    diagonal ),()1,1(max),(

    1lWjliHbaSjiH

    jiHil

    ji

    horizontal )](),([max1 lWljiHjlwith the initial condition H(0, 0)=0, H(i, 0) = W(i) and H(0, j) = W(j).

    To compute a cell, up to i+j+1 calculations have to be made, which have a time complexity O(n+m) or O(n).

    with the initial condition H(0, 0) 0, H(i, 0) W(i) and H(0, j) W(j).

    p y ( ) ( )Therefore the entire algorithm run in O(n3), which is considered to be

    too slow, though it is a polynomial algorithm.

  • Alignment algorithms with a general gap penalty The recursion relation for the local alignment version is,

    +

    = rtical ve] )(),([max

    diagonal ),()1,1(

    max),( 1lWjliHbaSjiH

    jiH il

    ji

    restart 0

    horizontal ])(),([max),(

    1lWljiH

    jijl

    To avoid too many adjacent short gaps separated by too short alignments, the following relation must be hold,

    with the initial condition H(i, 0)= H(0, j)= 0.

    g , g ,).()()( 2121 lWlWllW ++

    ml2sl1ml1+l2Favorable Unfavorable

    That is, the penalty for a long gap should not be greater than the penalty for two short ones that add up to the same length In general

    alignment alignment

    penalty for two short ones that add up to the same length. In general, this relation holds if the open gap penalty is larger than any mismatch and gap extension.

  • Alignment algorithms with affine gap penaltyBecause using a general form of gap penalty function slows down the

    algorithm, an affine gap penalty function is preferred.When using affine gap penalty function in dynamic programming, we

    only need to differentiate between the case that the gap is first beingonly need to differentiate between the case that the gap is first being introduced and the case that the gap is being extended.Let M(i,j) be the score of the best alignment up to the i-th letter in aLet M(i,j) be the score of the best alignment up to the i th letter in a

    and j-th letter in b, and ai is aligned with bj.

    i-1 ia

    Let I(i j) be the score of the best alignment up to the i th letter in a and

    i-1j-1

    _ij ),( jiM

    ab

    Let I(i,j) be the score of the best alignment up to the i-th letter in a and j-th letter in b, and ai is aligned with a space.

    i-1j

    i ),( jiIab

  • Alignment algorithms with affine gap penaltyLet J (i,j) be the score of the best alignment up to the i-th letter in a( ,j) g p

    and j-th letter in b, and bj is aligned with a space. i

    j 1 j),( jiJa

    b j-1 j The score of the best alignment to this point is the best of the three

    cases, ))()()(max()( jiJjiIjiMjiH =

    b

    Therefore, we need to fill four or at least three separate matrices. To compute M (i j) we consider the following three possibilities

    , )).,(),,(),,(max(),( jiJjiIjiMjiH =

    To compute M (i,j), we consider the following three possibilities,ij ),(

    )1,1(),(

    1 jbaSjiMjiM

    +=i-1

    j-1i-2j-2

    ab

    ai-1 aligns with bj-1, so we extend an alignment

    j

    ij ),(

    )1,1(),(

    1 jbaSjiIjiM

    +=i-1i-2

    j-1ab

    ai-1 aligns with a space, so we end a gap.

    j

    ij ),(

    )1,1(),(

    1 jbaSjiJjiM

    +=i-1

    j-2ab j-1

    bi-1 aligns with a space, so we end a gap.

  • Alignment algorithms with affine gap penalty Therefore, we have the following recursion to compute M(i,j),

    +++

    =)()11(

    ),()1,1(

    ),()1,1(

    max),(

    1

    1

    1

    j

    j

    j

    baSjiJbaSjiI

    baSjiMjiM

    Initialization: M(0,0)= 0, M(0, j)= - and M(i,0)= -

    + ),()1,1( 1 jbaSjiJ To compute I(i,j), we consider the following two possibilities,

    i MI )1()(i-1ai 1 aligns with bj, so i-2a opengjiMjiI = ),1(),(ji

    extgjiIjiI = ),1(),(i-1

    ai-1 a g s w t bj, sowe open a gap.

    ai-1 aligns with a space, d h

    j-1bi-2a

    extgjj ),(),(so we extend the gap. jbWe do not need to consider the following case, as it never happens by

    design (a high cost for opening a gap).

    Therefore we have the following recursion to compute I(i j)

    design (a high cost for opening a gap). i

    jbj aligns with a space

    i-1j-1

    ab

    Therefore, we have the following recursion to compute I(i, j),

    =

    ext

    open

    gjiIgjiM

    jiI),1(

    ),1(max),( Initialization: I(0,0)=-, I(0,j)=-, I(1,j)= -gopen,

    I(i,0)=-gopen - gext (i-1)

  • Alignment algorithms with affine gap penaltySimilarly, to compute J(i,j), we consider the following two possibilities,

    j opengjiMjij = )1,(),(ij-1

    bi-1 aligns with ai, so, we open a gap.

    i-1j-2

    ab

    j extgjiJjiJ = )1,(),(j-1bi-1 aligns with a space, so, we extent the gap.

    ij-1

    ab

    Therefore, we have the following recursion to compute J(i, j),

    = opengjiMjiJ )1,(max)(Initialization: J(0 0)= J(i 0) = J(i 1) = g

    =

    extgjiJjiJ

    )1,(max),(

    In this algorithm design, I or J type alignments can only be followed b h f li M li h

    J(0,0)= -, J(i, 0) = -, J(i, 1) = -gopen, J(0, j) = -gopen - gext (j-1)

    by the same type of alignment or M type alignment, thus we prevent alignments where a space in one sequence is immediately followed by a space in another sequence, such as,p q , ,

    S--HAKESPE-ARE

  • Alignment algorithms with affine gap penaltyTo see this, consider the following relation by setting gopen to values, g y g gopen

    large enough, so that,S(a,b) > - gopen

    Therefore, alignment (I) always scores higher than (II), because S(H,E)>-gopen. S--HAKE(I) S-HAKE

    As before, by adding the restart option and initialization conditions to

    S HAKESPE-ARE(I)

    S HAKESPEARE (II)

    this global alignment algorithm, we can produce an algorithm for local alignment.

    diagonal ),( jiM

    =0

    horizontal ),( rtical ve),(

    max),(jiJjiI

    jiH

    restart 0with the initial condition H(i, 0)= H(0, j)= 0.

  • Effect of scoring parameters on the alignmentAlthough dynamic

    iprogramming algorithm guarantee the optimal palignment between two sequences under the given scoringthe given scoring system, if we change the scoring system, diff t lt illdifferent results will be resulted. Shown left are Shown left are

    alignments between human and yeast h ki t ihexokinase proteins using different gap open penalties.

  • On line pairwise alignment programsNeedleman-Wunsch algorithm with an affine gap penalty function has

    been implemented by a few groups and the programs are freely available in both standalones or web-based applications.

    h hSuch as the GGSEARCHprogram by EBI p g yhttp://www.ebi.ac.uk/Tools/fasta33/index.h l? GGShtml?program=GGSEARCH

  • On line pairwise alignment programsSmith-Waterman algorithm with an affine gap penalty function has

    been also implemented by a few groups and programs are freely available in both standalones or web-based applications.

    h hSuch as the SSEARCHprogram: p gWeb-based application: h // bi khttp://www.ebi.ac.uk/Tools/fasta33/index.html?program=SSEp gARCH

  • On line pairwise alignment programsThe MPsrch program: another implementation of the Smith-Waterman

    l i h i h ll li ialgorithm with parallelization.Web-based application: http://www.ebi.ac.uk/Tools/MPsrch/index.html

    St d lStandalone program: http://www.ebi.ac.uk/Tools/webservices/clients/mpsrch

  • Multiple sequence alignment

    M lti l li t (MSA) li t f th th Multiple sequence alignment (MSA): alignment of more than three sequences, such that the alignment have the maximal score given a scoring matrix and gap penalty function.

    Theoretically, MSA can be solved by multidimensional dynamic programming, however it has time complexity O(NS) for aligning S

    f l th N it l b li d t fsequences of length N, so it can only be applied to few sequences. In fact, it has been shown that MAS is a NP-hard problem, therefore,

    there is no known efficient algorithm to solve it.there is no known efficient algorithm to solve it. Various heuristic algorithms have been proposed to align multiple

    sequences. They generally perform well when the sequences to be aligned are not too distantly related to one another.

    The most of these heuristic algorithms, such as the ClustalX/W algorithm se a progressi e alignment method to align m ltiplealgorithm, use a progressive alignment method to align multiple sequences.

  • Progressive alignment algorithmThis method starts with the most confident pairwise alignment, and p g

    then gradually add each sequence or groups of sequences to the already aligned MSA using a guide tree.F l th Cl t l l ith fi t t t h l ti tFor example, the Clustal algorithm first constructs a phylogenetic tree

    of the sequences to be aligned using the pairwise alignment of the sequences. The evolutionary distance between two sequences can be estimated by

    the Kimura estimator,d = -ln(1-D-0 2D2)d -ln(1-D-0.2D )

    Evolutionary distance can be also calculated from the alignment score using Feng and Doolittle formula,g g

    )].ln()[ln(100 randidentrand SSSSd =where S is the average score to align two random sequences S iswhere Srand is the average score to align two random sequences, Sident is the average score to align two identical sequences.

  • Progressive alignment algorithmClustal uses the neighbor-joining

    method to construct the tree using the computed evolutionary distance matrix.Using the tree as a guide Clustal alignsUsing the tree as a guide, Clustal aligns

    the two pairs of sequences that has shortest evolutionary distance, i.e., HXK2 RAT and HXK2 HUMAN, and HXK1 RAT and HXK1 HUMAN, using the Needleman-Wunsch globalusing the Needleman Wunsch global alignment algorithm.This two pairwise alignments are then

    aligned to form a cluster of four sequences.This process is repeated until allThis process is repeated until all

    clusters are joined to form a single cluster.

  • Alignment algorithm of two clusters of sequencesThe algorithm forThe algorithm for

    aligning two clusters of sequences are

    ti ll thessentially the same as aligning two sequences, but all the sequences in a cluster are treated as if they are a single sequencesequence. If a space is introduced

    in the cluster, it will be inserted at the same position in all sequences in the clustersequences in the cluster.

  • Alignment algorithm of two clusters of sequencesTo score the aligned sites i and j in two clusters, we can use the g j ,

    average of the individual scores for the amino acid pairs that can be formed between the clusters

    1 1 2n n tk ,)](),([1)](2),(1[1 1

    2121= =

    =k t

    tk jaiaSnn

    jclustericlusterS

    where n1 and n2 are the number of sequences, and ak1(i) and at2(j) are

    1 2 q , 1( ) 2(j)the amino acids at the sites i and j in the k-th and t-th sequences in clusters 1 and 2, respectively. For example, the score for aligning the following sites from two

    clusters, PAPosition i in cluster 1

    ld b

    AIRPosition j in cluster 2

    would be)].,(),(),(),([

    221)](2),(1[ RASIASRPSIPSjclustericlusterS +++=

  • Problems of progressive alignment algorithms Because the heuristic nature of the progressive multiple algorithm, p g p g ,

    global optimal is not guaranteed.

    In particular the errors made by an earlier step cannot be corrected by In particular, the errors made by an earlier step cannot be corrected by the later steps.

    T id h bi d b li i l d l d iTo avoid the bias caused by aligning very closed-related sequences in the earlier steps, Clustal uses a weighted scoring system, i.e., smaller weight is given to closed related sequences when computing the g g q p galignment scores.

    )]()([)](2)(1[1 2

    2,1 = n n tk jaiaSwjclustericlusterSSometimes, manual adjustment is needed in the regions that are not

    li d ll

    ,)](),([)](2),(1[1 1

    2121= =

    =k t

    jaiaSnn

    jclustericlusterS

    aligned very well.

  • Online multiple sequence alignment programsThe popular multiple sequence alignment programs include:The popular multiple sequence alignment programs include:

    1. ClustalX/W: http://www.clustal.org/

    2. T-Coffee: http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html

    3. MUSCLE: http://www.drive5.com/muscle/p

    Since MSA algorithms are still an active research area, new algorithms and programs are expected in the future. p g pA recent development is the SATe algorithm, which iteratively

    constructs the guide tree and the alignment until a convergent criterion i t htt // l k d / j / t / t ht lis met: http://people.ku.edu/~jyu/sate/sate.html