Page 1
Multiple sequence alignment
Why? It is the most important means to assess relatedness
of a set of sequences Gain information about the structure/function of a
query sequence (conservation patterns) Construct a phylogenetic tree Putting together a set of sequenced fragments
(Fragment assembly) Recognise alternative splice sites Many bioinformatics methods depend on it
(secondary/tertiary structure)
Page 2
Multiple sequence alignment (MSA) of 12 * Flavodoxin + cheY
Page 3
Pairwise alignment
Now we know how to do it: How do we get a multiple
alignment (three or more sequences)?
Multiple alignment: much greater combinatorial explosion than with pairwise alignment…..
Page 4
Multi-dimensional dynamic programming(Murata et al. 1985)
Page 5
Simultaneous Multiple alignmentMulti-dimensional dynamic programming
MSA (Lipman et al., 1989, PNAS 86, 4412)
extremely slow and memory intensive up to 8-9 sequences of ~250 residues
DCA (Stoye et al., 1997, CABIOS 13, 625)
still very slow
Page 6
Alternative multiple alignment methods
Biopat (Hogeweg Hesper 1984, first method ever)
MULTAL (Taylor 1987) DIALIGN (Morgenstern 1996) PRRP (Gotoh 1996) Clustal (Thompson Higgins Gibson 1994) Praline (Heringa 1999) T-Coffee (Notredame Higgins Heringa 2000) HMMER (Eddy 1998) [Hidden Markov Model] SAGA (Notredame Higgins1996) [Genetic
algorithm]
Page 7
Progressive multiple alignment general principles
1213
45
Guide tree Multiple alignment
Score 1-2
Score 1-3
Score 4-5
Scores Similaritymatrix5×5
Scores to distances Iteration possibilities
Page 8
General progressive multiple alignment technique(follow generated tree)
13
25
13
13
13
25
254
d
root
Page 9
Progressive multiple alignment
Problem: Accuracy is very important Errors are propagated into the
progressive steps
“Once a gap, always a gap”
Feng & Doolittle, 1987
Page 10
Pair-wise alignment quality versus sequence identity(Vogt et al., JMB 249, 816-831,1995)
Page 11
Multiple alignment profilesGribskov et al. 1987
ACDWY
Gappenalties
i0.30.100.30.3
0.51.0
Position dependent gap penalties
Page 12
ACD……VWY
sequence
profile
Profile-sequence alignment
Page 13
ACD..Y
ACD……VWY
profile
profileProfile-profile alignment
Page 14
Clustal, ClustalW, ClustalX CLUSTAL W/X (Thompson et al., 1994) uses Neighbour
Joining (NJ) algorithm (Saitou and Nei, 1984), widely used in phylogenetic analysis, to construct guide tree.
Sequence blocks are represented by profiles, in which the individual sequences are additionally weighted according to the branch lengths in the NJ tree.
Further carefully crafted heuristics include: (i) local gap penalties (ii) automatic selection of the amino acid substitution matrix,
(iii) automatic gap penalty adjustment (iv) mechanism to delay alignment of sequences that appear to
be distant at the time they are considered. CLUSTAL (W/X) does not allow iteration (Hogeweg and
Hesper, 1984; Corpet, 1988, Gotoh, 1996; Heringa, 1999, 2002)
Page 15
Profile pre-processing Secondary structure-induced
alignment Globalised local alignment Matrix extension
Objective: try to avoid (early) errors
Strategies for multiple sequence alignment
Page 16
Pre-profile generation1213
45
Score 1-2
Score 1-3
Score 4-5
ACD..Y
12345
1ACD..Y
21345
2
Pre-profilesPre-alignments
512354
ACD..Y
Cut-off
Page 17
Pre-profile alignment
ACD..YACD..YACD..Y
ACD..Y
ACD..Y
1
2
3
4
5
12345
Pre-profiles
Final alignment
Page 18
Pre-profile alignment
12345
12134531245
341235
4512354
2
12345
Final alignment
Page 19
Profile pre-processing Secondary structure-induced
alignment Globalised local alignment Matrix extension
Objective: try to avoid (early) errors
Strategies for multiple sequence alignment
Page 20
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE (oligomers)
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
Page 21
One of the Molecular Biology Dogma’s
“Structure more conserved than sequence”
Page 22
Secondary structure-induced alignment
Page 23
Using secondary structure for alignment
Dynamic programmingsearch matrix
Amino acid exchangeweights matrices
MDAGSTVILCFVHHHCCCEEEEEE
MDAASTILCGS
HHHHCCEEECC
C
H
E
H C
E Default
Page 24
Flavodoxin-cheYUsing predicted secondary structure1fx1 -PK-ALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACF e eeee b ssshhhhhhhhhhhhhhttt eeeee stt tttttt seeee b ee sss ee ttthhhhtt ttss tt eeeeeFLAV_DESVH MPK-ALIVYGSTTGNTEYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACf e eeeeee hhhhhhhhhhhhhhh eeeeee eeeeee hhhhhh eeeeeFLAV_DESGI MPK-ALIVYGSTTGNTEGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLYED-LDRAGLKDKKVGVf e eeeeee hhhhhhhhhhhhhh eeeeee hhhhhh eeeeeee hhhhhh eeeeeeFLAV_DESSA MSK-SLIVYGSTTGNTETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLYDS-LENADLKGKKVSVf eeeeee hhhhhhhhhhhhhh eeeee eeeee hhhhhhh h eeeeeFLAV_DESDE MSK-VLIVFGSSTGNTESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLFEE-FNRFGLAGRKVAAf eeee hhhhhhhhhhhhhh eeeee hhhhhhhhhhheeeee hhhhhhh hh eeeee2fcr --K-IGIFFSTSTGNTTEVADFIGKTLGAK---ADAPIDVDDVTDPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFLYDKLPEVDMKDLPVAIF eeeee ssshhhhhhhhhhhhhggg b eeggg s gggggg seeeeeee stt s s s sthhhhhhhtggg tt eeeeeFLAV_ANASP SKK-IGLFYGTQTGKTESVaEIIRDEFGND--VVTL-HDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLYSE-LDDVDFNGKLVAYf eeeee hhhhhhhhhhhh eee hhh hhhhhhheeeeee hhhhhhhhh eeeeeeFLAV_ECOLI -AI-TGIFFGSDTGNTENIaKMIQKQLGKD--VADV-HDIAKSS-KEDLEAYDILLLgIPTWYYGEA--------QCDWDDFFPT-LEEIDFNGKLVALf eee hhhhhhhhhhhh eee hhh hhhhhhheeeee hhhhh eeeeeeFLAV_AZOVI -AK-IGLFFGSNTGKTRKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFLPK-IEGLDFSGKTVALf eee hhhhhhhhhhhhh hhh hhhhhhheeeee hhhhhhhhh eeeeeeFLAV_ENTAG MAT-IGIFFGSDTGQTRKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFTNT-LSEADLTGKTVALf eeee hhhhhhhhhhhh hhh hhhhhhheeeee hhhhh eeeee4fxn ----MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVNIDELLNE-DILILGCSAMGDEVL------E-ESEFEPFIEE-IST-KISGKKVALF eeeee ssshhhhhhhhhhhhhhhtt eeeettt sttttt seeeeee btttb ttthhhhhhh hst t tt eeeeeFLAV_MEGEL M---VEIVYWSGTGNTEAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVASK-DVILLgCPAMGSEEL------E-DSVVEPFFTD-LAP-KLKGKKVGLf hhhhhhhhhhhhhh eeeee hhhhhhhh eeeee eeeeeFLAV_CLOAB M-K-ISILYSSKTGKTERVaKLIEEGVKRSGNIEVKTMNL-DAVDKKFLQESEGIIFgTPTY-YANI--------SWEMKKWIDE-SSEFNLEGKLGAAf eee hhhhhhhhhhhhhh eeeeee hhhhhhhhhh eeee hhhhhhhhh eeeee3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFNN-VEEAEDGV-DALNKLQAGGYGFVISD---WNMPNM----------DGLELLKTIRADGAMSALPVLMV tt eeee s hhhhhhhhhhhhhht eeeesshh hhhhhhhh eeeee s sss hhhhhhhhhh ttttt eeee 1fx1 GCGDS-SY-EYFCGAVDAIEEKLKNLGAEIVQD---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI-------- eee s ss sstthhhhhhhhhhhttt ee s eeees gggghhhhhhhhhhhhhhFLAV_DESVH GCGDS-SY-EYFCGAVDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI-------- eee hhhhhhhhhhhh eeeee eeeee hhhhhhhhhhhhhhFLAV_DESGI GCGDS-SY-TYFCGAVDVIEKKAEELgATLVAS---------------------SLKIDGE--P--DSAEVLDwAREVLARV-------- eee hhhhhhhhhhhh eeeee hhhhhhhhhhhFLAV_DESSA GCGDS-DY-TYFCGAVDAIEEKLEKMgAVVIGD---------------------SLKIDGD--P--ERDEIVSwGSGIADKI-------- hhhhhhhhhhhh eeeee e eeeFLAV_DESDE ASGDQ-EY-EHFCGAVPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL-------- e hhhhhhhhhhhhhh eeeee ee hhhhhhhhhhh2fcr GLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------ eee ttt ttsttthhhhhhhhhhhtt eee b gggs s tteet teesseeeettt ss hhhhhhhhhhhhhhhhtFLAV_ANASP GTGDQIGYADNFQDAIGILEEKISQRgGKTVGYWSTDGYDFNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------ hhhhhhhhhhhhhh eeee hhhhhhhhhhhhhhhhFLAV_ECOLI GCGDQEDYAEYFCDALGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA hhhhhhhhhhhhhh eeee hhhhhhhhhhhhhhhhhhFLAV_AZOVI GLGDQVGYPENYLDALGELYSFFKDRgAKIVGSWSTDGYEFESSEAVVD-GKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-- e hhhhhhhhhhhhhh eeeee hhhhhhhhhhhFLAV_ENTAG GLGDQLNYSKNFVSAMRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------ hhhhhhhhhhhhhhh eeee hhhhhhh hhhhhhhhhhhh4fxn G-----SYGWGDGKWMRDFEERMNGYGCVVVET---------------------PLIVQNE--PDEAEQDCIEFGKKIANI--------- e eesss shhhhhhhhhhhhtt ee s eeees ggghhhhhhhhhhhhtFLAV_MEGEL G-----SYGWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNAPE-CKElGEAAAKA--------- hhhhhhhhhhh eeeee eeee h hhhhhhhhFLAV_CLOAB STANSIA-GGSDIALLTILNHLMVK-gMLVYSG----GVAFGKPKTHLG-----YVHINEI--QENEDENARIfGERiANkV--KQIF-- hhhhhhhhhhhhhh eeeee hhhh hhh hhhhhhhhhhhh h3chy -----------TAEAKKENIIAAAQAGASGY-------------------------VVK----P-FTAATLEEKLNKIFEKLGM------ ess hhhhhhhhhtt see ees s hhhhhhhhhhhhhhht
G
Page 25
Profile pre-processing Secondary structure-induced
alignment Globalised local alignment Matrix extension
Objective: try to avoid (early) errors
Strategies for multiple sequence alignment
Page 26
Globalised local alignment
+ =
1. Local (SW) alignment (M + Po,e)
2. Global (NW) alignment (no M or Po,e)
Double dynamic programming
Page 27
M = BLOSUM62, Po= 0, Pe= 0
Page 28
M = BLOSUM62, Po= 12, Pe= 1
Page 29
M = BLOSUM62, Po= 60, Pe= 5
Page 30
Profile pre-processing Secondary structure-induced
alignment Globalised local alignment Matrix extension
Objective: try to avoid (early) errors
Strategies for multiple sequence alignment
Page 31
Matrix extension
T-CoffeeTree-based Consistency Objective Function
For alignmEnt Evaluation
Cedric Notredame
Des Higgins
Jaap Heringa J. Mol. Biol., J. Mol. Biol., 302, 205-217302, 205-217;2000;2000
Page 32
Matrix extension – T COFFEE
12
13
14
23
24
34
Page 33
Integrating alignment methods and alignment information with
T-Coffee• Integrating different pair-wise alignment
techniques (NW, SW, ..)
• Combining different multiple alignment methods (consensus multiple alignment)
• Combining sequence alignment methods with structural alignment techniques
• Plug in user knowledge
Page 34
Using different sources of alignment information
Clustal
Dialign
Clustal
Lalign
Structure alignments
Manual
T-Coffee
Page 35
Search matrix extension
Page 36
T-Coffee• Combine different alignment techniques by adding scores:
W(A(x), B(y)) = S(A(x), B(y))
– A(x) is residue x in sequence A
– summation is over the scores S of the global and local alignments containing the residue pair (A(x), B(y))
– S is sequence identity percentage of the associated alignment
• Combine direct alignment seqA- seqB with each seqA-seqI-seqB:
W’(A(x), B(y)) = W(A(x), B(y)) +
IA,BMin(W(A(x), I(z)), W(I(z), B(y)))
– Summation over all third sequences I other than A or B
Page 37
T-Coffee
Direct alignment
Other sequences
Page 38
Search matrix extension
Page 39
Evaluating multiple alignmentsEvaluating multiple alignments Conflicting standards of truth
evolution structure function
With orphan sequences no additional information Benchmarks depending on reference alignments Quality issue of available reference alignment
databases Different ways to quantify agreement with reference
alignment (sum-of-pairs, column score) “Charlie Chaplin” problem
Page 40
Evaluating multiple alignmentsEvaluating multiple alignments
As a standard of truth, often a reference alignment based on structural superpositioning is taken
Page 41
Evaluation measuresQuery Reference
Column score
Sum-of-Pairs score
Page 42
Evaluating multiple alignmentsEvaluating multiple alignments
SP
BAliBASE alignment nseq * len
Page 43
Summary
Weighting schemes simulating simultaneous multiple alignment Profile pre-processing (global/local) Matrix extension (well balanced scheme)
Smoothing alignment signals globalised local alignment
Using additional information secondary structure driven alignment
Schemes strike balance between speed and sensitivity
Page 44
References Heringa, J. (1999) Two strategies for sequence
comparison: profile-preprocessed and secondary structure-induced multiple alignment. Comp. Chem. 23, 341-364.
Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205-217.
Heringa, J. (2002) Local weighting schemes for protein multiple sequence alignment. Comput. Chem., 26(5), 459-477.
Page 45
Where to find this….http://www.ibivu.cs.vu.nl/teaching