Top Banner
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha
26

BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Jun 07, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

BLAST:Basic Local Alignment Search Tool

Altschul et al. J. Mol Bio. 1990.

CS 466Saurabh Sinha

Page 2: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Motivation

• Sequence homology to a known proteinsuggest function of newly sequencedprotein

• Bioinformatics task is to findhomologous sequence in a database ofsequences

• Databases of sequences growing fast

Page 3: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Alignment

• Natural approach to check if the “querysequence” is homologous to asequence in the database is to computealignment score of the two sequences

• Alignment score counts gaps(insertions, deletions) and replacements

• Minimizing the evolutionary distance

Page 4: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Alignment

• Global alignment: optimize the overallsimilarity of the two sequences

• Local alignment: find only relativelyconserved subsequences

• Local similarity measures preferred fordatabase searches– Distantly related proteins may only share isolated

regions of similarity

Page 5: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Alignment

• Dynamic programming is the standardapproach to sequence alignment

• Algorithm is quadratic in length of thetwo sequences

• Not practical for searches against verylarge database of sequences (e.g.,whole genome)

Page 6: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Scoring alignments

• Scoring matrix: 4 x 4 matrix (DNA) or 20x 20 matrix (protein)

• Amino acid sequences: “PAM” matrix– Consider amino acid sequence alignment for

very closely related proteins, extractreplacement frequencies (probabilities),extrapolate to greater evolutionary distances

• DNA sequences: match = +5, mismatch= -4

Page 7: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

BLAST: the MSP

• Given two sequences of same length, thesimilarity score of their alignment (withoutgaps) is the sum of similarity values for eachpair of aligned residues

• Maximal segment pair (MSP): Highest scoringpair of identical length segments from the twosequences

• The similarity score of an MSP is called theMSP score

• BLAST heuristically aims to find this

Page 8: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Locally maximal segment pair

• A molecular biologist may be interested in allconserved regions shared by two proteins, not justtheir highest scoring pair

• A segment pair (segments of identical lengths) islocally maximal if its score cannot be improved byextending or shortening in either direction

• BLAST attempts to find all locally maximal segmentpairs above some score cutoff.

Page 9: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Rapid approximation of MSP score

• Goal is to report those database sequences thathave MSP score above some threshold S.

• Statistics tells us what is the highest threshold Sat which “chance similarities” are likely to appear

• Tractability to statistical analysis is one of theattractive features of the MSP score

Page 10: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Rapid approximation of MSP score

• BLAST minimizes time spent on database sequenceswhose similarity with the query has little chance ofexceeding this cutoff S.

• Main strategy: seek only segment pairs (one fromdatabase, one query) that contain a word pair withscore >= T

• Intuition: If the sequence pair has to score above S,its most well matched word (of some predeterminedsmall length) must score above T

• Lower T => Fewer false negatives• Lower T => More pairs to analyze

Page 11: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Implementation

1. Compile a list of high scoring words2. Scan database for hits to this word list3. Extend hits

Page 12: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Step 1: Compiling list of wordsfrom query sequence

• For proteins: List of all w-length words thatscore at least T when compared to someword in query sequence

• Question: Does every word in the querysequence make it to the list?

• For DNA: list of all w-length words in thequery sequence, often with w=12

Page 13: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Step 2: Scanning thedatabase for hits

• Find exact matches to list words• Can be done in linear time

– two methods (next slides)• Each word in list points to all

occurrences of the word in word listfrom previous step

Page 14: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Scanning the database for hits

• Method 1: Let w=4, so 204 possible words• Each integer in 0 … 204-1 is an index for an

array• Array element point to list of all occurrences

of that word in query• Not all 204 elements of array are populated

– only the ones in word list from previousstep

Page 15: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Scanning the database for hits

• Method 2: use “deterministic finiteautomaton” or “finite state machine”.

• Similar to the keyword trees seen incourse.

• Build the finite state machine out of allwords in word list from previous step

Page 16: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Step 3: Extending hits

• Once a word pair with score >= T has beenfound, extend it in each direction.

• Extend until score >= S is obtained• During extension, score may go up, and then

down, and then up again• Terminate if it goes down too much (a certain

distance below the best score found forshorter extensions)

• One implementation allows gaps duringextension

Page 17: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

BLAST: approximating the MSP

• BLAST may not find all segment pairsabove threshold S

• Trying to approximate the MSP• Bounds on the error: not hard bounds,

but statistical bounds– “Highly likely” to find the MSP

Page 18: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Statistics

• Suppose the MSP has been calculated byBLAST (and suppose this is the true MSP)

• Suppose this observed MSP scores S.• What are the chances that the MSP score for

two unrelated sequences would be >= S?• If the chances are very low, then we can be

confident that the two sequences must nothave been unrelated

Page 19: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Statistics

• Given two random sequences oflengths m and n

• Probability that they will produce anMSP score of >= x ?

Page 20: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Statistics

• Number of separate SPs with score >= x is Poissondistributed with mean y(x) = Kmn exp(-λx), where

• λ is the positive solution of∑pipjexp(λs(i,j)) = 1

• K is a constant• s(i,j) is the scoring matrix, pi is the frequency of

i in random sequences

Page 21: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Statistics• Poisson distribution:

Pr(x) = (e- λ λx)/x!• Pr(#SPs >= α)= 1 - Pr(#SPs <= α-1)

!

=1"e"yyi

i!i= 0

#"1

$

=1" e"yyi

i!i= 0

#"1

$

Page 22: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

Statistics• For α=1, Pr(#SPs >= 1) = 1-e-y(x)

• Choose S such that 1-e-y(S) is small• Suppose the probability of having at least 1 SP with

score >= S is 0.001.• This seems reasonably small• However, if you test 10000 random sequences, you

expect 10 to cross the threshold• Therefore, require “E-value” to be small.• That is, expected number of random sequence pairs

with score >= S should be small.

Page 23: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

More statistics

• We just saw how to choose threshold S• How to choose T ?• BLAST is trying to find segment pairs

(SPs) scoring above S• If an SP scores S, what is the

probability that it will have a w-wordmatch of score T or more?

• We want this probability to be high

Page 24: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

More statistics: Choosing T

• Given a segment pair (from two randomsequences) that scores S, what is theprobability q that it will have no w-wordmatch scoring above T?

• Want this q to be low• Obtained from simulations• Found to decrease exponentially as S

increases

Page 25: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

BLAST is the universally usedbioinformatics tool

Page 26: BLAST: Basic Local Alignment Search Tool …veda.cs.uiuc.edu/courses/fa08/cs466/lectures/Lecture17.pdfBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS

http://flybase.org/blast/