Multiple Sequence Alignment (MSA) 1. Uses of MSA 2. Technical difficulties 1. Select sequences 2. Select objective function 3. Optimize the objective function 1. Exact algorithms 2. Progressive algorithms 3. Iterative algorithms 1. Stochastic 2. Non-stochastic 4. Consistency-based algorithms 3. Tools to view alignments 1. MEGA 2. JALVIEW (PSI-BLAST) Function prediction Fig. from Boris Steipe U. of Toronto Sequence relationships If the MSA is incorrect, the above inferences are incorrect! Chapter 12
82
Embed
Multiple Sequence Alignment (MSA)cschweikert/cisc4020/multSeqAlign… · Multiple sequence alignment: methods Progressive methods: use a guide tree (a little like a phylogenetic tree
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Multiple Sequence Alignment (MSA)
1. Uses of MSA
2. Technical difficulties
1. Select sequences
2. Select objective function
3. Optimize the objective function
1. Exact algorithms
2. Progressive algorithms
3. Iterative algorithms
1. Stochastic
2. Non-stochastic
4. Consistency-based algorithms
3. Tools to view alignments
1. MEGA
2. JALVIEW
(PSI-BLAST)
Function prediction
Fig. from Boris Steipe
U. of TorontoSequence
relationships
If the MSA is incorrect, the
above inferences are incorrect!
Chapter 12
Multiple sequence alignment: definition
• a collection of three or more protein (or nucleic acid)
sequences that are partially or completely aligned
• homologous residues are aligned in columns
across the length of the sequences
• residues are homologous in an evolutionary sense
• residues are homologous in a structural sense
ClustalW
Note how the region of a conserved histidine (▼) varies
depending on which algorithm is used
Praline
MUSCLE
Probcons
TCoffee
Multiple sequence alignment: properties
• not necessarily one “correct” alignment of a protein family
• Individual weights are assigned to sequences; very closely related sequences are given less weight,while distantly related sequences are given more weight
• Scoring matrices are varied dependent on the presenceof conserved or divergent sequences, e.g.:
PAM20 80-100% idPAM60 60-80% idPAM120 40-60% idPAM350 0-40% id
• Residue-specific gap penalties are applied
Iterative algorithmsRecurrent modifications of suboptimal solutions
Stochastic Iterative Algorithms
SAGA
• Uses a ‘Genetic Algorithm’
• Can use different objective functions (e.g. Coffee)
• Mutations randomly insertion or shift gaps
• Sequences can recombine
• Sequences evolve, higher OF scores survive
GAs and HMMs have probed rather
disappointing in ab initio alignments.
Better: Pre-compute MSA with other
program and then use this ones for
optimization
Evolution of a seq. alignment by recombination
Compatible ends
Consistency-based Algorithms
T-Coffee (Consistency Objective Function For alignmEnt Evaluation)
Version 2.00 and higher can mix
sequences and structures
Local and global pair-wise alignments
can come from different programs and
can be redundantThe EL is a position-specific substitution matrix
where the score associated with each pair of
residues depends on its compatibility with the
rest of the library. This library replaces the
Mutation data Matrix used in ClustalW.
Pair-wise distances are computed
A Neighbor joining tree is estimated
Sequences are aligned progressively following
the topology of the tree
Benchmark tests (from Notredame 2002)
ClustalW: performed well on Ref. sets 1-3, but poorly on 4-5 when long
internal or terminal gaps are required.
When large gaps required T-Coffee and DiAlign perform better
Summary strategies
Alignment Editors
Jalview
• Written in Java
• Input MSF, aligned FASTA
• ClustalW alignment
• Interactive alignment editor
• Multiple color schemes
• Can divide in sub-families
• Produces UPGMA, Neighbor-
joining trees and Principal
Component Analysis
• Incorporates information from
feature Table
• Incorporates structural inormation
Alignment Editors
Alignment Visualization
Multiple Sequence Alignment
Multiple Sequence Alignment
At the end of the day ...
• Use more than one alignment method,
• Make sure you have the right sequences.
• Don't align parts of sequences that can't be aligned
(because they are not homologuous).
• Realize problems from multi-domain proteins.
• Above all, use your common sense.
Multiple Sequence Alignment
Multiple Sequence Alignment
From Boris Steipe Univ. of Toronto
Hidden Markov Models
Bioinformatics, Sequence and Genome Analysis (pg205) , Mount.
The model accommodates the identities, mismatches, insertions, and deletions expected in a group of related proteins.
(A) MSA: Each column may include matches and mismatches (red positions), insertions (green positions), and deletions (purple positions).
(B) Each column in the model represents the possibility of a match, insert, or delete in each column of the alignment in A. The HMM is a probabilistic representation of the MSA. Sequences can be generated from the HMM by starting at the beginning state labeled BEG and then by following anyone of many pathways from one type of sequence variation to another (states) along the state transition arrows and terminating in the ending state labeled END. Any sequence can be generated by the model and each pathway has a probability associated with it. Each square match state stores an amino acid distribution such that the probability of finding an amino acid depends on the frequency of that amino acid within that match state. Each diamond-shaped insert state produces random amino acid letters for insertions between aligned columns and each circular delete state produces a deletion in the alignment with probability 1.
One of many ways of generating the sequence N K Y L T in the above profile is by the sequence BEG ->Ml ->11 ->M2 ->M3 :>M4 ->END. Each transition has an associated probability, and the sum of the probabilities of transitions leaving each state is 1. The average value of a transition would thus be 0.33, since there are three transitions from most states (there are only two from M4 and D4, hence the average from them is 0.5). For example, if a match state contains a uniform distribution across the 20 amino acids, the probability of any amino acid is 0.05. Using these average values of 0.33 or 0.5 for the transition values and 0.05 for the probability of each amino acid in each state, the probability of the above sequence N K Y L T is the product of all of the transition probabilities in the path and the probability that each state will produce the corresponding amino acid in the sequences, or 0.33 X 0.05 X 0.33 X
0.05 X 0.33 X 0.05 X 0.33 X 0.05 X 0.33 X 0.05 X 0.5 = 6.1 X 10-10. Since these probabilities are very small numbers, probabilities are converted to log odds scores, and the logarithms are added to give the overall probability score.
The secret of the HMM is to adjust the transition values and the distributions in each state by training the model with the sequences. The training involves finding every possible pathway through the model that can produce the sequences, counting the number of times each transition is used and which amino acids were required by each match and insert state to produce the sequences. This training procedure leaves a memory of the sequences in the model. As a consequence, the model will be able to give a better prediction of the sequences. Once the model has been adequately trained, of all the possible paths through the model that can generate the sequence N KY L T, the most probable should be the match-insert-3 match combination (as opposed to any other combination of matches, inserts, and deletions). Likewise, the other sequences in the alignment would also be predicted with highest probability as they appear in the alignment; i.e., the last sequence would be predicted with highest probability by the path match-match-delete-match. In this fashion, the trained HMM provides a multiple sequence alignment, such as shown in A. For each sequence, the objective is to infer the sequence of states in the model that generate the sequences. The generated sequence is a Markov chain because the next state is dependent on the current one. Because the actual sequence information is hidden with-in the model, the model is described as a hidden Markov model
Multiple sequence alignment to profile HMMs
► Hidden Markov models (HMMs) are “states”that describe the probability of having a
particular amino acid residue at arranged
in a column of a multiple sequence alignment
► HMMs are probabilistic models
► HMMs may give more sensitive alignmentsthan traditional techniques such as
progressive alignment
Simple Hidden Markov Model
Observation: YNNNYYNNNYN
(Y=goes out, N=doesn’t go out)
What is underlying reality (the hidden state chain)?