BLAST

Alignment-based methodsAlignment-based methods• Needed if we have an unknown DNA or Needed if we have an unknown DNA or

protein sequence.protein sequence.

• Purpose:Purpose:To find sequences/regions of To find sequences/regions of significantsignificant

similarity in a sequence repository or similarity in a sequence repository or database.database.

To identify To identify allall of the homologous of the homologous sequences in a database or repository.sequences in a database or repository.

To identify motifs or domains with a To identify motifs or domains with a sequence similarity that is sequence similarity that is significantly better than chance significantly better than chance expectationexpectation

Local alignmentFinds domains and short regions of similarity between a pair of sequences. The two sequences under comparison do not necessarily need to have high levels of similarity over their entire length in order to receive locally high similarity scores. This feature of local similarity searches give them the advantage of being useful when looking for domains within proteins or looking for regions of genomic DNA that contain introns. Local similarity searches do not have the constraint that similarity between two sequences needs to be observed over the entire length of each gene.

Global alignmentFinds the optimal alignment over the entire length of the two sequences under comparison. Algorithms of this nature are not particularly suited to the identification of genes that have evolved by recombination or insertion of unrelated regions of DNA. In instances such as this, a global similarity score will be greatly reduced. In cases where genes are being aligned whose sequences are of comparable length and also whose entire gene is homologous (descendant from a common ancestor), global alignment might be considered appropriate.

TerminologyTerminology

• Exact (Exhaustive):Exact (Exhaustive):This is a method of looking at all This is a method of looking at all

possibilities for a particular possibilities for a particular problem and then choosing the best problem and then choosing the best one. It is the most rigorous method.one. It is the most rigorous method.

• Heuristic:Heuristic:This class of methods takes short-cuts This class of methods takes short-cuts

and attempts to arrive at an and attempts to arrive at an optimal solution by making optimal solution by making educated guesses.educated guesses.

Needleman-WunschNeedleman-Wunsch

Exact global alignment method.Exact global alignment method.Not particularly good in many cases (database Not particularly good in many cases (database searches, looking for small regions of similarity, searches, looking for small regions of similarity, alignment of sequences with vastly differing alignment of sequences with vastly differing lengths), but the most rigorous and thorough lengths), but the most rigorous and thorough method if the task is to align sequences that have method if the task is to align sequences that have not evolved by exon shuffling, domain not evolved by exon shuffling, domain insertion/deletion etc. In other words, it is the best insertion/deletion etc. In other words, it is the best method if you have sequences that are of ‘similar’ method if you have sequences that are of ‘similar’ length and have evolved from a common ancestor length and have evolved from a common ancestor by point processes (point mutation, small indels).by point processes (point mutation, small indels).

Smith-WatermanSmith-WatermanExact local alignmentExact local alignment

There is no requirement for the alignment to There is no requirement for the alignment to extend along the entirety of the sequences. This extend along the entirety of the sequences. This is a very good algorithm for database searching, is a very good algorithm for database searching, multiple alignment and pairwise alignment.multiple alignment and pairwise alignment.It is exhaustive and can be very slow (compared It is exhaustive and can be very slow (compared to the heuristics described later). The difference to the heuristics described later). The difference between this and the N-W algorithm is that between this and the N-W algorithm is that alignments starting at all possible positions must alignments starting at all possible positions must be considered, not just the ones that start at the be considered, not just the ones that start at the beginning and end at the end.beginning and end at the end.

Commonly-used search algorithms

Algorithm Exhaust? gaps? Loc./Glo. Mul align. Dbase searches

Needleman-Wunsch Yes Yes Global √ X

Smith-Waterman Yes Yes Local √ √

FASTA No Yes Local √ √

BLAST No No Local X √

FastA algorithmFastA algorithmPearson, W. R. (1996). Effective protein sequence comparison. Academic PressInc. 227-258.Pearson, W. R. and Lipman, D. J. (1988). Improved tools for biologicalsequence comparison.” Proc. Natl. Acad. Sci. USA 85: 2444-2448.

•Firstly regions of identity are identified between the Firstly regions of identity are identified between the query and database sequences. (query and database sequences. (KTUPKTUP))•Then the genes with the highest Then the genes with the highest densitydensity of matching ‘hits’ of matching ‘hits’ are re-examinedare re-examined•The alignments are extended at either end of the matching The alignments are extended at either end of the matching regions and mis-matches and indels are incorporated regions and mis-matches and indels are incorporated according to a according to a scoring matrixscoring matrix..•The sequence alignment then gets a score (sometimes a The sequence alignment then gets a score (sometimes a match is 1, a mismatch is 0 and a gap is -1)match is 1, a mismatch is 0 and a gap is -1)

PAM 250 matrix

A R N D C Q E G H I L K M F P S T W Y V A 2 R -2 6 N 0 0 2 D 0 -1 2 4 C -2 -4 -4 -5 12 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2 4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3 1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4

FastA algorithmFastA algorithm

• Is the alignment Is the alignment significantsignificant??

• Could we see an alignment like this purely by Could we see an alignment like this purely by chance?chance?

• What are the statistics involved?What are the statistics involved?

z-opt E()z-opt E()< 20 0 0 :< 20 0 0 :** 22 0 0 :22 0 0 :** 24 0 0 :24 0 0 :** 26 0 0 :26 0 0 :** 28 0 3 :28 0 3 :** 30 0 18 :30 0 18 :** 32 11 70 := 32 11 70 := ** 34 73 190 :==== 34 73 190 :==== ** 36 430 389 :================36 430 389 :================**==== 38 969 644 :===========================38 969 644 :===========================**============================== 40 1086 898 :=======================================40 1086 898 :=======================================**================ 42 1332 1097 :===============================================42 1332 1097 :===============================================**==================== 44 1252 1211 :====================================================44 1252 1211 :====================================================**==== 46 1022 1233 :============================================= 46 1022 1233 :============================================= ** 48 1041 1181 :============================================== 48 1041 1181 :============================================== ** 50 982 1077 :=========================================== 50 982 1077 :=========================================== ** 52 846 947 :===================================== 52 846 947 :===================================== ** 54 716 809 :================================ 54 716 809 :================================ ** 56 650 676 :=============================56 650 676 :=============================** 58 547 555 :========================58 547 555 :========================** 60 409 449 :================== 60 409 449 :================== ** 62 369 360 :===============62 369 360 :===============**== 64 289 287 :============64 289 287 :============** 66 232 226 :=========66 232 226 :=========**== 68 176 178 :=======68 176 178 :=======** 70 163 140 :======70 163 140 :======**== 72 124 109 :====72 124 109 :====**== 74 88 85 :===74 88 85 :===** 76 73 66 :==76 73 66 :==**== 78 73 51 :==78 73 51 :==**== 80 44 40 :=80 44 40 :=** 82 32 31 :=82 32 31 :=** 84 23 24 :=84 23 24 :=** 86 19 19 :86 19 19 :** 88 15 14 :88 15 14 :** 90 8 11 :90 8 11 :** 92 11 9 :92 11 9 :** :======== :========**==== 94 3 7 :94 3 7 :** :=== :=== ** 96 2 5 :96 2 5 :** :== :== ** 98 6 4 :98 6 4 :** :=== :===**==== 100 2 3 :100 2 3 :** :== :==** 102 4 2 :102 4 2 :** := :=**==== 104 3 2 :104 3 2 :** := :=**== 106 0 1 :106 0 1 :** : :** 108 0 1 :108 0 1 :** : :** 110 1 1 :110 1 1 :** : :** 112 0 1 :112 0 1 :** : :** 114 1 1 :114 1 1 :** : :** 116 0 0 :116 0 0 :** ** 118 0 0 :118 0 0 :** **>120 1 0 :>120 1 0 :** **==

Results of a Results of a FastA searchFastA search

The best scores are: initn init1 opt z-sc E(13127)The best scores are: initn init1 opt z-sc E(13127)HP0793 polypeptide deformylase (def) {Escherichia 66 66 100 126.9 0.71HP0793 polypeptide deformylase (def) {Escherichia 66 66 100 126.9 0.71AF2215 methylmalonyl-CoA mutase, subunit alpha, N 45 45 94 113.9 1.2AF2215 methylmalonyl-CoA mutase, subunit alpha, N 45 45 94 113.9 1.2AF1231 hypothetical protein 50 50 86 104.9 4.4AF1231 hypothetical protein 50 50 86 104.9 4.4MJ1169 tungsten formylmethanofuran dehydrogenase, 45 45 85 102.7 4.8MJ1169 tungsten formylmethanofuran dehydrogenase, 45 45 85 102.7 4.8AF0267 hypothetical protein 71 71 84 101.2 5.5AF0267 hypothetical protein 71 71 84 101.2 5.5AF1486 hypothetical protein 83 83 84 102.4 6.1AF1486 hypothetical protein 83 83 84 102.4 6.1AF0262 medium-chain acyl-CoA ligase (alkK-2) {Pse 50 50 82 99.2 7.8AF0262 medium-chain acyl-CoA ligase (alkK-2) {Pse 50 50 82 99.2 7.8AF0229 conserved hypothetical protein {Methanococ 58 58 83 103.0 8.2AF0229 conserved hypothetical protein {Methanococ 58 58 83 103.0 8.2D09_orf125.gseg, 378 bases, 5AC53121 checksum. 50 50 85 110.0 8.5D09_orf125.gseg, 378 bases, 5AC53121 checksum. 50 50 85 110.0 8.5SL251_1.UVRC 1797 residues 40 40 81 97.5 8.9SL251_1.UVRC 1797 residues 40 40 81 97.5 8.9slr2049 hypothetical protein 83 83 83 105.5 9.9slr2049 hypothetical protein 83 83 83 105.5 9.9AF0868 alkyldihydroxyacetonephosphate synthase {C 45 45 80 97.7 12AF0868 alkyldihydroxyacetonephosphate synthase {C 45 45 80 97.7 12AF1320 GMP synthase (guaA-2) {Methanococcus janna 35 35 82 104.5 12AF1320 GMP synthase (guaA-2) {Methanococcus janna 35 35 82 104.5 12SL159_1.PKSK 13344 residues 99 74 74 79.2 12SL159_1.PKSK 13344 residues 99 74 74 79.2 12slr1771 40 40 79 95.6 13slr1771 40 40 79 95.6 13sll1018 dihydroorotase (pyrC) 60 60 79 96.6 14sll1018 dihydroorotase (pyrC) 60 60 79 96.6 14slr2102 cell division protein FtsY (ftsY) 77 77 78 94.7 15slr2102 cell division protein FtsY (ftsY) 77 77 78 94.7 15AF0946 hypothetical protein 67 67 76 88.8 16AF0946 hypothetical protein 67 67 76 88.8 16AF1325 multidrug resistance protein {Methanococcu 55 55 77 95.0 20AF1325 multidrug resistance protein {Methanococcu 55 55 77 95.0 20SL194_2.BFMBB 1272 residues 75 75 76 93.1 22SL194_2.BFMBB 1272 residues 75 75 76 93.1 22

Original BLASTOriginal BLAST

• Segment pair - This is a pair of Segment pair - This is a pair of subsequences of the same length that form subsequences of the same length that form an ungapped alignment.an ungapped alignment.

• BLAST searches for all segment pairs BLAST searches for all segment pairs between the query sequence and all of the between the query sequence and all of the sequences in the database (above a certain sequences in the database (above a certain threshold).threshold).

• HSP - High-Scoring pair.HSP - High-Scoring pair.

Original BLASTOriginal BLAST

• HSPs are derived by first finding the HSPs are derived by first finding the pairs that satisfy the threshold (T) pairs that satisfy the threshold (T) conditions. Then the alignment is conditions. Then the alignment is extended in both directions until the extended in both directions until the quality of the alignment drops off quality of the alignment drops off dramatically or falls to zero.dramatically or falls to zero.

• The HSPs are then sorted according The HSPs are then sorted according to their score.to their score.

Gapped BLASTGapped BLAST• The original BLAST suffered from the The original BLAST suffered from the

limitation of not being able to introduce limitation of not being able to introduce gaps into the alignment.gaps into the alignment.

• Gapped BLAST is an effort to Gapped BLAST is an effort to circumvent this shortcoming.circumvent this shortcoming.

• Experience shows that often several Experience shows that often several ungapped non-overlapping alignments ungapped non-overlapping alignments result from a match to a single database result from a match to a single database entry.entry.

Gapped BLASTGapped BLAST

• Intuitively, we know that it probably makes Intuitively, we know that it probably makes sense to generate a single alignment of the sense to generate a single alignment of the query and database sequences.query and database sequences.

• Gapped BLAST seeks only 1 (instead of all) Gapped BLAST seeks only 1 (instead of all) of the significant ungapped alignments of the significant ungapped alignments between query and database sequence.between query and database sequence.

• This speeds up the processThis speeds up the process

Two-Hit MethodTwo-Hit Method

• Find 2 HSPs within a distance Find 2 HSPs within a distance m m of of each other each other on the same diagonal.on the same diagonal.

• Do not attempt any HSP extension Do not attempt any HSP extension unless you find two regions that unless you find two regions that meet this criterion.meet this criterion.

• Attempt to generate a single gapped Attempt to generate a single gapped alignment in this regionalignment in this region

How does this affect the process How does this affect the process of searching a database?of searching a database?

1 The treshold for identifying HSPs The treshold for identifying HSPs can be lowered (finding more HSPs can be lowered (finding more HSPs and therefore slowing the process).and therefore slowing the process).

2 Fewer extensions are triggered Fewer extensions are triggered (speeds up the process).(speeds up the process).

PSI(PSI(-BLAST-BLAST• Position-Specific Iterative BLAST.Position-Specific Iterative BLAST.

• One of a family of 'profile' searches.One of a family of 'profile' searches.

• Reweights amino acids in the alignment.Reweights amino acids in the alignment.

• Performs an initial BLAST search.Performs an initial BLAST search.

• Select those ‘hits’ that appear to be significant Select those ‘hits’ that appear to be significant (above a certain threshold).(above a certain threshold).

• Use the alignment of these sequences to identify Use the alignment of these sequences to identify possible 'important' residues.possible 'important' residues.

PSI-BLASTPSI-BLAST

• Similar sequences contain almost identical Similar sequences contain almost identical information.information.

• Distant relatives contain more information Distant relatives contain more information (if an amino acid residue is conserved in a (if an amino acid residue is conserved in a distant relative, then it must be important!?).distant relative, then it must be important!?).

• PSI-BLAST takes into account the similarity PSI-BLAST takes into account the similarity of the 'hits' when identifying important of the 'hits' when identifying important residues.residues.

PSI(PSI(-BLAST-BLAST• Reweigh those ‘important’ residues.Reweigh those ‘important’ residues.

• Repeat the BLAST search, but this time Repeat the BLAST search, but this time giving an increased weight to the giving an increased weight to the important residues.important residues.

• This process can be repeated This process can be repeated ad ad infinituminfinitum, although usually 2 or 3 , although usually 2 or 3 iterations will suffice.iterations will suffice.

PSI(PSI(-BLAST-BLAST

• Advantages:Advantages:Identify more distant relatives.Identify more distant relatives.

Faster than more exact methods.Faster than more exact methods.

Does not require Does not require a prioria priori knowledge of the knowledge of the important residues.important residues.

• Disadvantages:Disadvantages:Can be misleading if an unrelated sequence is Can be misleading if an unrelated sequence is

involved in reweighing the residues.involved in reweighing the residues.

Not very reliable unless the initial BLAST search is Not very reliable unless the initial BLAST search is capable of identifying homologues.capable of identifying homologues.

Similarity scoring matrices.

Not all Nucleotide/Amino Acid substitutions occur at equal

frequencies. (Transitions are usually more frequent than

transversions).

For DNA sequences it is usual to use a matrix that scores matches

in a positive way and mismatches in a negative way. Matching a

known base with one of the ambiguity codes (IUPAC nomenclature)

is usually treated less severely than a mismatch with a known base.

Scoring Matrices

For protein sequences, the matrix can either be a PAM, BLOSUM

or Gonnet matrix.

These have been empirically determined and have been calculated

by the direct comparison of related protein sequences.

In general, amino acid substitutions that are seen to occur very

rarely are given a negative value.

Conservative substitutions (for instance an isoleucine for a

leucine) are given a positive value. Identical matches are also

given a positive score.

Identical Residues

The values of the identical matches vary and are dependent on

whether the amino acids are common or rare. In the PAM 250

matrix, a tryptophan:tryptophan match is given a score of 17

(tryptophan being a relatively rare amino acid), whilst a

serine:serine match is only given a score of 2 (serine is usually

abundantly represented in protein sequences). The reason that the

scoring matrix has developed in such a fashion is due to the higher

probability that common amino acids will align together by chance.

Significance of the Significance of the similarity of two sequencessimilarity of two sequences

• How can you know if two sequences How can you know if two sequences show a higher degree of similarity show a higher degree of similarity than could be expected by chance?than could be expected by chance?

Similarity could be due to similar Similarity could be due to similar base/AA biases.base/AA biases.

Similarity could be due to sequence Similarity could be due to sequence simplicity.simplicity.

Randomisation testRandomisation test• Align the two sequences, record their score.Align the two sequences, record their score.

• Hold one sequence in its original form and Hold one sequence in its original form and randomise the order of the residues in the randomise the order of the residues in the other sequence, record the score.other sequence, record the score.

• Repeat many (1,000) times.Repeat many (1,000) times.

• The original score should be a better score The original score should be a better score than any score from the randomised data.than any score from the randomised data.

BLAST

Technology

alignment of sequences

sequence alignment

database sequences

optimal alignment

multiple alignment

sequence similarity

homologous sequences

sequences undercomparison