Alignment-based Alignment-based methods methods • Needed if we have an unknown DNA Needed if we have an unknown DNA or protein sequence. or protein sequence. • Purpose: Purpose: To find sequences/regions of To find sequences/regions of significant significant similarity in a similarity in a sequence repository or database. sequence repository or database. To identify To identify all all of the homologous of the homologous sequences in a database or sequences in a database or repository. repository. To identify motifs or domains To identify motifs or domains with a sequence similarity that with a sequence similarity that is significantly better than is significantly better than chance expectation chance expectation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Alignment-based methodsAlignment-based methods• Needed if we have an unknown DNA or Needed if we have an unknown DNA or
protein sequence.protein sequence.
• Purpose:Purpose:To find sequences/regions of To find sequences/regions of significantsignificant
similarity in a sequence repository or similarity in a sequence repository or database.database.
To identify To identify allall of the homologous of the homologous sequences in a database or repository.sequences in a database or repository.
To identify motifs or domains with a To identify motifs or domains with a sequence similarity that is sequence similarity that is significantly better than chance significantly better than chance expectationexpectation
Local alignmentFinds domains and short regions of similarity between a pair of sequences. The two sequences under comparison do not necessarily need to have high levels of similarity over their entire length in order to receive locally high similarity scores. This feature of local similarity searches give them the advantage of being useful when looking for domains within proteins or looking for regions of genomic DNA that contain introns. Local similarity searches do not have the constraint that similarity between two sequences needs to be observed over the entire length of each gene.
Global alignmentFinds the optimal alignment over the entire length of the two sequences under comparison. Algorithms of this nature are not particularly suited to the identification of genes that have evolved by recombination or insertion of unrelated regions of DNA. In instances such as this, a global similarity score will be greatly reduced. In cases where genes are being aligned whose sequences are of comparable length and also whose entire gene is homologous (descendant from a common ancestor), global alignment might be considered appropriate.
TerminologyTerminology
• Exact (Exhaustive):Exact (Exhaustive):This is a method of looking at all This is a method of looking at all
possibilities for a particular possibilities for a particular problem and then choosing the best problem and then choosing the best one. It is the most rigorous method.one. It is the most rigorous method.
• Heuristic:Heuristic:This class of methods takes short-cuts This class of methods takes short-cuts
and attempts to arrive at an and attempts to arrive at an optimal solution by making optimal solution by making educated guesses.educated guesses.
Needleman-WunschNeedleman-Wunsch
Exact global alignment method.Exact global alignment method.Not particularly good in many cases (database Not particularly good in many cases (database searches, looking for small regions of similarity, searches, looking for small regions of similarity, alignment of sequences with vastly differing alignment of sequences with vastly differing lengths), but the most rigorous and thorough lengths), but the most rigorous and thorough method if the task is to align sequences that have method if the task is to align sequences that have not evolved by exon shuffling, domain not evolved by exon shuffling, domain insertion/deletion etc. In other words, it is the best insertion/deletion etc. In other words, it is the best method if you have sequences that are of ‘similar’ method if you have sequences that are of ‘similar’ length and have evolved from a common ancestor length and have evolved from a common ancestor by point processes (point mutation, small indels).by point processes (point mutation, small indels).
Smith-WatermanSmith-WatermanExact local alignmentExact local alignment
There is no requirement for the alignment to There is no requirement for the alignment to extend along the entirety of the sequences. This extend along the entirety of the sequences. This is a very good algorithm for database searching, is a very good algorithm for database searching, multiple alignment and pairwise alignment.multiple alignment and pairwise alignment.It is exhaustive and can be very slow (compared It is exhaustive and can be very slow (compared to the heuristics described later). The difference to the heuristics described later). The difference between this and the N-W algorithm is that between this and the N-W algorithm is that alignments starting at all possible positions must alignments starting at all possible positions must be considered, not just the ones that start at the be considered, not just the ones that start at the beginning and end at the end.beginning and end at the end.
Commonly-used search algorithms
Algorithm Exhaust? gaps? Loc./Glo. Mul align. Dbase searches
Needleman-Wunsch Yes Yes Global √ X
Smith-Waterman Yes Yes Local √ √
FASTA No Yes Local √ √
BLAST No No Local X √
FastA algorithmFastA algorithmPearson, W. R. (1996). Effective protein sequence comparison. Academic PressInc. 227-258.Pearson, W. R. and Lipman, D. J. (1988). Improved tools for biologicalsequence comparison.” Proc. Natl. Acad. Sci. USA 85: 2444-2448.
•Firstly regions of identity are identified between the Firstly regions of identity are identified between the query and database sequences. (query and database sequences. (KTUPKTUP))•Then the genes with the highest Then the genes with the highest densitydensity of matching ‘hits’ of matching ‘hits’ are re-examinedare re-examined•The alignments are extended at either end of the matching The alignments are extended at either end of the matching regions and mis-matches and indels are incorporated regions and mis-matches and indels are incorporated according to a according to a scoring matrixscoring matrix..•The sequence alignment then gets a score (sometimes a The sequence alignment then gets a score (sometimes a match is 1, a mismatch is 0 and a gap is -1)match is 1, a mismatch is 0 and a gap is -1)
PAM 250 matrix
A R N D C Q E G H I L K M F P S T W Y V A 2 R -2 6 N 0 0 2 D 0 -1 2 4 C -2 -4 -4 -5 12 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2 4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3 1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4
FastA algorithmFastA algorithm
• Is the alignment Is the alignment significantsignificant??
• Could we see an alignment like this purely by Could we see an alignment like this purely by chance?chance?
• What are the statistics involved?What are the statistics involved?
• Segment pair - This is a pair of Segment pair - This is a pair of subsequences of the same length that form subsequences of the same length that form an ungapped alignment.an ungapped alignment.
• BLAST searches for all segment pairs BLAST searches for all segment pairs between the query sequence and all of the between the query sequence and all of the sequences in the database (above a certain sequences in the database (above a certain threshold).threshold).
• HSPs are derived by first finding the HSPs are derived by first finding the pairs that satisfy the threshold (T) pairs that satisfy the threshold (T) conditions. Then the alignment is conditions. Then the alignment is extended in both directions until the extended in both directions until the quality of the alignment drops off quality of the alignment drops off dramatically or falls to zero.dramatically or falls to zero.
• The HSPs are then sorted according The HSPs are then sorted according to their score.to their score.
Gapped BLASTGapped BLAST• The original BLAST suffered from the The original BLAST suffered from the
limitation of not being able to introduce limitation of not being able to introduce gaps into the alignment.gaps into the alignment.
• Gapped BLAST is an effort to Gapped BLAST is an effort to circumvent this shortcoming.circumvent this shortcoming.
• Experience shows that often several Experience shows that often several ungapped non-overlapping alignments ungapped non-overlapping alignments result from a match to a single database result from a match to a single database entry.entry.
Gapped BLASTGapped BLAST
• Intuitively, we know that it probably makes Intuitively, we know that it probably makes sense to generate a single alignment of the sense to generate a single alignment of the query and database sequences.query and database sequences.
• Gapped BLAST seeks only 1 (instead of all) Gapped BLAST seeks only 1 (instead of all) of the significant ungapped alignments of the significant ungapped alignments between query and database sequence.between query and database sequence.
• This speeds up the processThis speeds up the process
Two-Hit MethodTwo-Hit Method
• Find 2 HSPs within a distance Find 2 HSPs within a distance m m of of each other each other on the same diagonal.on the same diagonal.
• Do not attempt any HSP extension Do not attempt any HSP extension unless you find two regions that unless you find two regions that meet this criterion.meet this criterion.
• Attempt to generate a single gapped Attempt to generate a single gapped alignment in this regionalignment in this region
How does this affect the process How does this affect the process of searching a database?of searching a database?
1 The treshold for identifying HSPs The treshold for identifying HSPs can be lowered (finding more HSPs can be lowered (finding more HSPs and therefore slowing the process).and therefore slowing the process).
2 Fewer extensions are triggered Fewer extensions are triggered (speeds up the process).(speeds up the process).
• One of a family of 'profile' searches.One of a family of 'profile' searches.
• Reweights amino acids in the alignment.Reweights amino acids in the alignment.
• Performs an initial BLAST search.Performs an initial BLAST search.
• Select those ‘hits’ that appear to be significant Select those ‘hits’ that appear to be significant (above a certain threshold).(above a certain threshold).
• Use the alignment of these sequences to identify Use the alignment of these sequences to identify possible 'important' residues.possible 'important' residues.
PSI-BLASTPSI-BLAST
• Similar sequences contain almost identical Similar sequences contain almost identical information.information.
• Distant relatives contain more information Distant relatives contain more information (if an amino acid residue is conserved in a (if an amino acid residue is conserved in a distant relative, then it must be important!?).distant relative, then it must be important!?).
• PSI-BLAST takes into account the similarity PSI-BLAST takes into account the similarity of the 'hits' when identifying important of the 'hits' when identifying important residues.residues.
PSI(PSI(-BLAST-BLAST• Reweigh those ‘important’ residues.Reweigh those ‘important’ residues.
• Repeat the BLAST search, but this time Repeat the BLAST search, but this time giving an increased weight to the giving an increased weight to the important residues.important residues.
• This process can be repeated This process can be repeated ad ad infinituminfinitum, although usually 2 or 3 , although usually 2 or 3 iterations will suffice.iterations will suffice.
PSI(PSI(-BLAST-BLAST
• Advantages:Advantages:Identify more distant relatives.Identify more distant relatives.
Faster than more exact methods.Faster than more exact methods.
Does not require Does not require a prioria priori knowledge of the knowledge of the important residues.important residues.
• Disadvantages:Disadvantages:Can be misleading if an unrelated sequence is Can be misleading if an unrelated sequence is
involved in reweighing the residues.involved in reweighing the residues.
Not very reliable unless the initial BLAST search is Not very reliable unless the initial BLAST search is capable of identifying homologues.capable of identifying homologues.
Similarity scoring matrices.
Not all Nucleotide/Amino Acid substitutions occur at equal
frequencies. (Transitions are usually more frequent than
transversions).
For DNA sequences it is usual to use a matrix that scores matches
in a positive way and mismatches in a negative way. Matching a
known base with one of the ambiguity codes (IUPAC nomenclature)
is usually treated less severely than a mismatch with a known base.
Scoring Matrices
For protein sequences, the matrix can either be a PAM, BLOSUM
or Gonnet matrix.
These have been empirically determined and have been calculated
by the direct comparison of related protein sequences.
In general, amino acid substitutions that are seen to occur very
rarely are given a negative value.
Conservative substitutions (for instance an isoleucine for a
leucine) are given a positive value. Identical matches are also
given a positive score.
Identical Residues
The values of the identical matches vary and are dependent on
whether the amino acids are common or rare. In the PAM 250
matrix, a tryptophan:tryptophan match is given a score of 17
(tryptophan being a relatively rare amino acid), whilst a
serine:serine match is only given a score of 2 (serine is usually
abundantly represented in protein sequences). The reason that the
scoring matrix has developed in such a fashion is due to the higher
probability that common amino acids will align together by chance.
Significance of the Significance of the similarity of two sequencessimilarity of two sequences
• How can you know if two sequences How can you know if two sequences show a higher degree of similarity show a higher degree of similarity than could be expected by chance?than could be expected by chance?
Similarity could be due to similar Similarity could be due to similar base/AA biases.base/AA biases.
Similarity could be due to sequence Similarity could be due to sequence simplicity.simplicity.
Randomisation testRandomisation test• Align the two sequences, record their score.Align the two sequences, record their score.
• Hold one sequence in its original form and Hold one sequence in its original form and randomise the order of the residues in the randomise the order of the residues in the other sequence, record the score.other sequence, record the score.
• Repeat many (1,000) times.Repeat many (1,000) times.
• The original score should be a better score The original score should be a better score than any score from the randomised data.than any score from the randomised data.