Biology 4900 Biology 4900 Biocomputing
Jan 19, 2016
Biology 4900Biology 4900
Biocomputing
Chapter 4Chapter 4
BLAST
BLASTBLAST
BLAST allows user to search a sequence (the query) against millions of sequences in the NCBI database (the target).
Global alignments (e.g., Needleman-Wunsch) would be time consuming and computationally intensive for this amount of data.
BLAST is designed for local alignment, not global alignment. Allows for faster searches, can match subsets of proteins (e.g.,
domains).
Ca2+
13
57
9
12 F helix
F helix
8
12
8
1357
9
Ca2+
C-terminal domain of CaM (from 3cln.pdb)
Other BLAST ProgramsOther BLAST Programs Blastx: Compares nucleotide query sequence translated in all
reading frames (3 possible proteins for each DNA strand) against a protein sequence DB.
Tblastn: Compares protein query sequence against a nucleotide sequence DB.
Tblastx: Compares the 6-frame translations of a nucleotide query sequence against the 6-frame translations of a nucleotide sequence database.
Pevsner, Bioinformatics and Functional Genomics, 2009
5’ CAT CAA 5’ ATC AAC 5’ TCA ACT
5’ GTG GGT 5’ TGG GTA 5’ GGG TAG
5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’
Choose the BLAST program
Program Input Database 1
blastn DNA DNA 1
blastp protein protein 6
blastx DNA protein 6
tblastn protein DNA 36
tblastx DNA DNA
BLAST (Altschul 1990)BLAST (Altschul 1990)
Blast uses a pre-indexed database of ‘words’ for all proteins in the database (Similar to FASTA).
A word is defined as a short sequence of letters. For Blastp, the default word (W) size is 3 letters. For Blastn, the default word (W) size is 11 letters. For MegaBLAST (nucleotide), the default word (W) size is 28 letters.
When you run a query, BLAST breaks your query sequence into a series of words, and generates neighborhood words, as in the following example:
http://www.incogen.com/bioinfo_tutorials/Bioinfo-Lecture_2-pairwise-align.html
For sequence…FSGTWYA…
A list of words (w=3) is:FSG SGT GTW TWY WYAYSG TGT ATW SWY WFAFTG SVT GSW TWF WYS
Words
Neighborhood Words
Why use BLAST?Why use BLAST?
• BLAST searching is fundamental to understanding the relatedness of any favorite query sequence to other known proteins or DNA sequences.
Applications include• identifying orthologs and paralogs• discovering new genes or proteins• discovering variants of genes or proteins• investigating expressed sequence tags (ESTs)• exploring protein structure and function
Four steps to becoming a Master BLASTerFour steps to becoming a Master BLASTer
http://mestadelsbilder.wordpress.com/2011/10/23/master-blaster/
(1) Choose the sequence (query)
(2) Select the BLAST program
(3) Choose the database to search
(4) Choose optional parameters (may leave as default params the first time)
Then click “BLAST”
Step 1: Choose your sequenceStep 1: Choose your sequence
Sequence can be input in FASTA format as text or by file upload, or as accession number
Example of the FASTA format for a BLAST queryExample of the FASTA format for a BLAST query
Note link here
Step 2: Choose the BLAST programStep 2: Choose the BLAST program
Blastn and blastp are the main programs you will want to use
Step 3: choose the database to search Step 3: choose the database to search
nr = non-redundant (most general database)
dbest = database of expressed sequence tags
dbsts = database of sequence tag sites
gss = genomic survey sequences
protein databases
nucleotide databases
Step 4a: Select optional search parametersStep 4a: Select optional search parameters
Entrez!
algorithm
organism
Step 4a: optional blastp search parametersStep 4a: optional blastp search parameters
Filter, mask
Scoring matrix
Word size
Expect
Right. So, what are these?
Step 4a: optional blastn search parameters
Filter, mask
Match/mismatch scores
Word size
Expect
Algorithm Parameters: ExpectAlgorithm Parameters: Expect
• This setting specifies the statistical significance threshold for reporting matches against database sequences.
• The default value (10) means that 10 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990).
• If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported.
• Lower EXPECT thresholds (e.g., set expect to 6) are more stringent, leading to fewer chance matches being reported.
http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
Algorithm Parameters: Word SizeAlgorithm Parameters: Word Size• BLAST is a heuristic algorithm (makes approximations) that works by
finding word-matches between the query and database sequences. This process finds "hot-spots" that BLAST can then potentiallyextend into full-blown alignments.
• For nucleotide-nucleotide searches (i.e., "blastn") an exact match of the entire word is required before an extension is initiated, so that one normally regulates the sensitivity and speed of the search by increasing or decreasing the word-size.
• For other BLAST searches non-exact word matches are taken into account based upon the similarity between words. The amount of similarity can be varied so one normally uses just the word-sizes 2 and 3 for these searches.
http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)
MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit)
Hit!extendextend
Algorithm Parameters: FiltersAlgorithm Parameters: Filters
• The Low-complexity filter option masks part of query sequence that may represent very common, non-complex subsets of sequence.
• May not be very useful.• The Species-repeats repeats for: filter option is designed to
ignore species-specific genomic repeats in very long sequences.
http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
Algorithm Parameters: MasksAlgorithm Parameters: Masks• The Mask for lookup table only option masks only for purposes of
constructing the lookup table used by BLAST so that no hits are found based upon low-complexity sequence or repeats (if repeat filter is checked).
• The BLAST extensions are performed without masking and so they can be extended through low-complexity sequence.
• The Mask lower case letters option lets you cut and paste a FASTA sequence in upper case characters and denote areas you would like filtered with lower case. This allows you to customize what is filtered from the sequence during the comparison to the BLAST databases.
http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
Ex. agvgpADEEWGYilmaagDDEEE
These parts of sequence in LC letters masked, or ignored
Algorithm Parameters: Match/Mismatch ScoresAlgorithm Parameters: Match/Mismatch Scores
• Many nucleotide searches use a simple scoring system that consists of a "reward" for a match and a "penalty" for a mismatch.
• The (absolute) reward/penalty ratio should be increased as one looks at more divergent sequences. – A ratio of 0.33 (1/-3) is appropriate for sequences that are
about 99% conserved – A ratio of 0.5 (1/-2) is best for sequences that are 95%
conserved – A ratio of about one (1/-1) is best for sequences that are
75% conserved
States DJ, Gish W, and Altschul SF (1991)
Algorithm Parameters: MatricesAlgorithm Parameters: Matrices
• A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues.
• Some matrices are good for comparing sequences that diverge very little, while other matrices are good for comparing sequences that diverge a lot.
• The BLOSUM-62 matrix is among the best for detecting most weak protein similarities.
• The BLOSUM-45 matrix may be better for particularly long and weak alignments.
• The older PAM matrices may be better for short alignments, as these need to have a higher percentage of matching residues to exceed background noise (be detectable beyond random chance).
http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
Matrices and Gap CostsMatrices and Gap Costs
Query Length
Substitution Matrix
Gap Costs
<35 PAM-30 (9,1)35-50 PAM-70 (10,1)50-85 BLOSUM-80 (10,1)
85 BLOSUM-62 (10,1)
The raw score of an alignment is the sum of the scores for aligning pairs of residues and the scores for gaps. Gapped BLAST and PSI-BLAST use "affine gap costs" which charge the score -a for the existence of a gap, and the score -b for each residue in the gap. Thus a gap of k residues receives a total score of -(a+bk); specifically, a gap of length 1 receives the score -(a+b).Your total raw score for the alignment is reduced when you introduce gaps into the query sequence.
Calculate the score in BLOSUM-62 for a gap with 7 residues…
http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
(T=11)
BLAST (Altschul 1990)BLAST (Altschul 1990)
Word Letter score Total score
Neighborhood word hit > threshold (T)
Neighborhood word hit < threshold (T)
GTW 6,5,11 22GSW 6,1,11 18ATW 0,5,11 16NTW 0,5,11 16GTY 6,5,2 13ANT 1,0,-5 -4
Neighborhood words are similar to constructed words from query, with one or more mismatched symbols.
These are given scores based on the matrix that you are using (for BLAST, the default matrix is BLOSUM62).
Neighborhood words that score above a user-defined threshold are also searched.
BLAST (Altschul 1990)BLAST (Altschul 1990)
Blast then searches the entire database for the search words and neighborhood words.
Once a match is found, BLAST then extends the search in both directions of the sequence, scoring each subsequent match, until the score drops below some cutoff value.
KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)
MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit)
Hit!extendextend
BLAST (1997)BLAST (1997)
In a 1997 refinement of BLAST, two independent hits are required.
The hits must occur in close proximity to each other.
With this modification, only 1/7 as many extensions occur, greatly speeding the time required for a search.
Changing BLAST Input ParametersChanging BLAST Input Parameters
Increasing W or T will increase speed, but will result in loss of sensitivity (i.e., you will miss some matches)
The expect value(E-value) can be changed in order to limit the number of hits to the most significant ones. Lower E-value = better hit. E-value is dependent on length of query sequence and size
of database. Example: an alignment obtaining an E-value of 0.05
means that there is a 5 in 100 chance of occurring by chance alone.
BLAST Output from DB SearchBLAST Output from DB Search
Graphic Summary includes conserved domains, when applicable.
Ca2+
13
57
9
12F helix
F helix
8
12
8
1357
9
Ca2+
BLAST Output from DB SearchBLAST Output from DB Search
Graphic Summary includes distribution of blast hits. Color coded by bit Score. Higher score related to higher sequence identity.
High scoreslow E values
BLAST search output: tabular outputBLAST search output: tabular output
BLAST search output: alignment outputBLAST search output: alignment output
Blast Output include evolutionary tree viewBlast Output include evolutionary tree view
Run 3cln to observe tree view options
Pairwise Alignment with Dot PlotsPairwise Alignment with Dot Plots
>lcl|24241 3CLN:A|PDBID|CHAIN|SEQUENCELength=148
Score = 268 bits (684), Expect = 3e-97, Method: Compositional matrix adjust. Identities = 130/148 (88%), Positives = 143/148 (97%), Gaps = 0/148 (0%)
Query 1 AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 A+QLTEEQIAEFKEAF+LFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNSbjct 1 ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60
Query 61 GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 120 GTIDFPEFL++MARKMK+ DSEEE+ EAF+VFD+DGNG ISAAELRHVMTNLGEKLTD+ESbjct 61 GTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEE 120
Query 121 VDEMIREADIDGDGHINYEEFVRMMVSK 148 VDEMIREA+IDGDG +NYEEFV+MM +KSbjct 121 VDEMIREANIDGDGQVNYEEFVQMMTAK 148
3CLN
1EXR
Pairwise Alignment with Dot PlotsPairwise Alignment with Dot Plots
Score = 30.0 bits (66), Expect = 1e-06, Method: Compositional matrix adjust. Identities = 14/51 (27%), Positives = 26/51 (51%), Gaps = 3/51 (6%)
Query 62 TIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNL 112 + D +F M+ K K D ++++ F + DKD +G+I EL ++ Sbjct 23 SFDHKKFFQMVGLKKKSAD---DVKKVFHILDKDKSGFIEEDELGSILKGF 70
Score = 25.8 bits (55), Expect = 3e-05, Method: Compositional matrix adjust. Identities = 11/40 (28%), Positives = 21/40 (53%), Gaps = 0/40 (0%)
Query 4 LTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNP 43 L ++ + K+ F + DKD G I ELG++++ + Sbjct 35 LKKKSADDVKKVFHILDKDKSGFIEEDELGSILKGFSSDA 74
3CLN
1RTP
3CLN
1RTP
Statistics of Local AlignmentsStatistics of Local Alignments
• For local pairwise alignments, best approach to determining statistical significance is to estimate an expect value (E value).
• The expect value E is the number of alignments with scores greater than or equal to score S (your score) that are expected to occur by chance in a database search.
• A score with an associated E value of 10-3 means that this particular score may occur 1 time out of 1000 alignments by chance.
• An E value is related to a probability value p.
• The key equation describing an E value is:
• E = Kmn e-S
Pevsner, Bioinformatics and Functional Genomics, 2009
EE = = KmnKmn e e--SS
• This equation is derived from a description of the extreme value distribution
• S = the score
• E = the expect value = the number of high-scoring segment pairs (HSPs) expected to occur with a score of at least S
• m, n = the length of two sequences
• , K = Karlin Altschul statistics
Some properties of the equation Some properties of the equation EE = = KmnKmn e e--SS
• The value of E decreases exponentially with increasing S (higher S values correspond to better alignments). Very high scores correspond to very low E values.
•The E value for aligning a pair of random sequences must be negative! Otherwise, long random alignments would acquire great scores
• Parameter K describes the search space (database).
• For E=1, one match with a similar score is expected to occur by chance. For a very much larger or smaller database, you would expect E to vary accordingly
From raw scores to bit scoresFrom raw scores to bit scores
• There are two kinds of scores: raw scores (calculated from a substitution matrix) and bit scores (normalized scores)
• Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes
S’ = bit score = (S - lnK) / ln2
The E value corresponding to a given bit score is:E = mn 2 -S’
Bit scores allow you to compare results between differentdatabase searches, even using different scoring matrices.
The expect value E is the number of alignmentswith scores greater than or equal to score Sthat are expected to occur by chance in a database search. A p value is a different way ofrepresenting the significance of an alignment.
p = 1 - e-
How to interpret BLAST: E values and p values
Very small E values are very similar to p values. E values of about 1 to 10 are far easier to interpretthan corresponding p values.
E p10 0.999954605 0.993262052 0.864664721 0.632120560.1 0.09516258 (about 0.1)0.05 0.04877058 (about 0.05)0.001 0.00099950 (about 0.001)0.0001 0.0001000
How to interpret BLAST: E values and p values