Biology 4900 Biocomputing. Chapter 4 BLAST BLAST BLAST allows user to search a sequence (the query) against millions of sequences in the NCBI database.

Biology 4900Biology 4900

Biocomputing

Chapter 4Chapter 4

BLAST

BLASTBLAST

BLAST allows user to search a sequence (the query) against millions of sequences in the NCBI database (the target).

Global alignments (e.g., Needleman-Wunsch) would be time consuming and computationally intensive for this amount of data.

BLAST is designed for local alignment, not global alignment. Allows for faster searches, can match subsets of proteins (e.g.,

domains).

Ca2+

13

57

9

12 F helix

F helix

8

12

8

1357

9

Ca2+

C-terminal domain of CaM (from 3cln.pdb)

Other BLAST ProgramsOther BLAST Programs Blastx: Compares nucleotide query sequence translated in all

reading frames (3 possible proteins for each DNA strand) against a protein sequence DB.

Tblastn: Compares protein query sequence against a nucleotide sequence DB.

Tblastx: Compares the 6-frame translations of a nucleotide query sequence against the 6-frame translations of a nucleotide sequence database.

Pevsner, Bioinformatics and Functional Genomics, 2009

5’ CAT CAA 5’ ATC AAC 5’ TCA ACT

5’ GTG GGT 5’ TGG GTA 5’ GGG TAG

5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’

Choose the BLAST program

Program Input Database 1

blastn DNA DNA 1

blastp protein protein 6

blastx DNA protein 6

tblastn protein DNA 36

tblastx DNA DNA

BLAST (Altschul 1990)BLAST (Altschul 1990)

Blast uses a pre-indexed database of ‘words’ for all proteins in the database (Similar to FASTA).

A word is defined as a short sequence of letters. For Blastp, the default word (W) size is 3 letters. For Blastn, the default word (W) size is 11 letters. For MegaBLAST (nucleotide), the default word (W) size is 28 letters.

When you run a query, BLAST breaks your query sequence into a series of words, and generates neighborhood words, as in the following example:

http://www.incogen.com/bioinfo_tutorials/Bioinfo-Lecture_2-pairwise-align.html

For sequence…FSGTWYA…

A list of words (w=3) is:FSG SGT GTW TWY WYAYSG TGT ATW SWY WFAFTG SVT GSW TWF WYS

Words

Neighborhood Words

Why use BLAST?Why use BLAST?

• BLAST searching is fundamental to understanding the relatedness of any favorite query sequence to other known proteins or DNA sequences.

Applications include• identifying orthologs and paralogs• discovering new genes or proteins• discovering variants of genes or proteins• investigating expressed sequence tags (ESTs)• exploring protein structure and function

Four steps to becoming a Master BLASTerFour steps to becoming a Master BLASTer

http://mestadelsbilder.wordpress.com/2011/10/23/master-blaster/

(1) Choose the sequence (query)

(2) Select the BLAST program

(3) Choose the database to search

(4) Choose optional parameters (may leave as default params the first time)

Then click “BLAST”

Step 1: Choose your sequenceStep 1: Choose your sequence

Sequence can be input in FASTA format as text or by file upload, or as accession number

Example of the FASTA format for a BLAST queryExample of the FASTA format for a BLAST query

Note link here

Step 2: Choose the BLAST programStep 2: Choose the BLAST program

Blastn and blastp are the main programs you will want to use

Step 3: choose the database to search Step 3: choose the database to search

nr = non-redundant (most general database)

dbest = database of expressed sequence tags

dbsts = database of sequence tag sites

gss = genomic survey sequences

protein databases

nucleotide databases

Step 4a: Select optional search parametersStep 4a: Select optional search parameters

Entrez!

algorithm

organism

Step 4a: optional blastp search parametersStep 4a: optional blastp search parameters

Filter, mask

Scoring matrix

Word size

Expect

Right. So, what are these?

Step 4a: optional blastn search parameters

Filter, mask

Match/mismatch scores

Word size

Expect

Algorithm Parameters: ExpectAlgorithm Parameters: Expect

• This setting specifies the statistical significance threshold for reporting matches against database sequences.

• The default value (10) means that 10 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990).

• If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported.

• Lower EXPECT thresholds (e.g., set expect to 6) are more stringent, leading to fewer chance matches being reported.

http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/

Algorithm Parameters: Word SizeAlgorithm Parameters: Word Size• BLAST is a heuristic algorithm (makes approximations) that works by

finding word-matches between the query and database sequences. This process finds "hot-spots" that BLAST can then potentiallyextend into full-blown alignments.

• For nucleotide-nucleotide searches (i.e., "blastn") an exact match of the entire word is required before an extension is initiated, so that one normally regulates the sensitivity and speed of the search by increasing or decreasing the word-size.

• For other BLAST searches non-exact word matches are taken into account based upon the similarity between words. The amount of similarity can be varied so one normally uses just the word-sizes 2 and 3 for these searches.


KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)

MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit)

Hit!extendextend

Algorithm Parameters: FiltersAlgorithm Parameters: Filters

• The Low-complexity filter option masks part of query sequence that may represent very common, non-complex subsets of sequence.

• May not be very useful.• The Species-repeats repeats for: filter option is designed to

ignore species-specific genomic repeats in very long sequences.


Algorithm Parameters: MasksAlgorithm Parameters: Masks• The Mask for lookup table only option masks only for purposes of

constructing the lookup table used by BLAST so that no hits are found based upon low-complexity sequence or repeats (if repeat filter is checked).

• The BLAST extensions are performed without masking and so they can be extended through low-complexity sequence.

• The Mask lower case letters option lets you cut and paste a FASTA sequence in upper case characters and denote areas you would like filtered with lower case. This allows you to customize what is filtered from the sequence during the comparison to the BLAST databases.


Ex. agvgpADEEWGYilmaagDDEEE

These parts of sequence in LC letters masked, or ignored

Algorithm Parameters: Match/Mismatch ScoresAlgorithm Parameters: Match/Mismatch Scores

• Many nucleotide searches use a simple scoring system that consists of a "reward" for a match and a "penalty" for a mismatch.

• The (absolute) reward/penalty ratio should be increased as one looks at more divergent sequences. – A ratio of 0.33 (1/-3) is appropriate for sequences that are

about 99% conserved – A ratio of 0.5 (1/-2) is best for sequences that are 95%

conserved – A ratio of about one (1/-1) is best for sequences that are

75% conserved

States DJ, Gish W, and Altschul SF (1991)

Algorithm Parameters: MatricesAlgorithm Parameters: Matrices

• A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues.

• Some matrices are good for comparing sequences that diverge very little, while other matrices are good for comparing sequences that diverge a lot.

• The BLOSUM-62 matrix is among the best for detecting most weak protein similarities.

• The BLOSUM-45 matrix may be better for particularly long and weak alignments.

• The older PAM matrices may be better for short alignments, as these need to have a higher percentage of matching residues to exceed background noise (be detectable beyond random chance).


Matrices and Gap CostsMatrices and Gap Costs

Query Length

Substitution Matrix

Gap Costs

<35 PAM-30 (9,1)35-50 PAM-70 (10,1)50-85 BLOSUM-80 (10,1)

85 BLOSUM-62 (10,1)

The raw score of an alignment is the sum of the scores for aligning pairs of residues and the scores for gaps. Gapped BLAST and PSI-BLAST use "affine gap costs" which charge the score -a for the existence of a gap, and the score -b for each residue in the gap. Thus a gap of k residues receives a total score of -(a+bk); specifically, a gap of length 1 receives the score -(a+b).Your total raw score for the alignment is reduced when you introduce gaps into the query sequence.

Calculate the score in BLOSUM-62 for a gap with 7 residues…


(T=11)


Word Letter score Total score

Neighborhood word hit > threshold (T)

Neighborhood word hit < threshold (T)

GTW 6,5,11 22GSW 6,1,11 18ATW 0,5,11 16NTW 0,5,11 16GTY 6,5,2 13ANT 1,0,-5 -4

Neighborhood words are similar to constructed words from query, with one or more mismatched symbols.

These are given scores based on the matrix that you are using (for BLAST, the default matrix is BLOSUM62).

Neighborhood words that score above a user-defined threshold are also searched.


Blast then searches the entire database for the search words and neighborhood words.

Once a match is found, BLAST then extends the search in both directions of the sequence, scoring each subsequent match, until the score drops below some cutoff value.

KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)

MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit)

Hit!extendextend

BLAST (1997)BLAST (1997)

In a 1997 refinement of BLAST, two independent hits are required.

The hits must occur in close proximity to each other.

With this modification, only 1/7 as many extensions occur, greatly speeding the time required for a search.

Changing BLAST Input ParametersChanging BLAST Input Parameters

Increasing W or T will increase speed, but will result in loss of sensitivity (i.e., you will miss some matches)

The expect value(E-value) can be changed in order to limit the number of hits to the most significant ones. Lower E-value = better hit. E-value is dependent on length of query sequence and size

of database. Example: an alignment obtaining an E-value of 0.05

means that there is a 5 in 100 chance of occurring by chance alone.

BLAST Output from DB SearchBLAST Output from DB Search

Graphic Summary includes conserved domains, when applicable.

Ca2+

13

57

9

12F helix

F helix

8

12

8

1357

9

Ca2+

BLAST Output from DB SearchBLAST Output from DB Search

Graphic Summary includes distribution of blast hits. Color coded by bit Score. Higher score related to higher sequence identity.

High scoreslow E values

BLAST search output: tabular outputBLAST search output: tabular output

BLAST search output: alignment outputBLAST search output: alignment output

Blast Output include evolutionary tree viewBlast Output include evolutionary tree view

Run 3cln to observe tree view options

Pairwise Alignment with Dot PlotsPairwise Alignment with Dot Plots

>lcl|24241 3CLN:A|PDBID|CHAIN|SEQUENCELength=148

Score = 268 bits (684), Expect = 3e-97, Method: Compositional matrix adjust. Identities = 130/148 (88%), Positives = 143/148 (97%), Gaps = 0/148 (0%)

Query 1 AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 A+QLTEEQIAEFKEAF+LFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNSbjct 1 ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60

Query 61 GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 120 GTIDFPEFL++MARKMK+ DSEEE+ EAF+VFD+DGNG ISAAELRHVMTNLGEKLTD+ESbjct 61 GTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEE 120

Query 121 VDEMIREADIDGDGHINYEEFVRMMVSK 148 VDEMIREA+IDGDG +NYEEFV+MM +KSbjct 121 VDEMIREANIDGDGQVNYEEFVQMMTAK 148

3CLN

1EXR

Pairwise Alignment with Dot PlotsPairwise Alignment with Dot Plots

Score = 30.0 bits (66), Expect = 1e-06, Method: Compositional matrix adjust. Identities = 14/51 (27%), Positives = 26/51 (51%), Gaps = 3/51 (6%)

Query 62 TIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNL 112 + D +F M+ K K D ++++ F + DKD +G+I EL ++ Sbjct 23 SFDHKKFFQMVGLKKKSAD---DVKKVFHILDKDKSGFIEEDELGSILKGF 70

Score = 25.8 bits (55), Expect = 3e-05, Method: Compositional matrix adjust. Identities = 11/40 (28%), Positives = 21/40 (53%), Gaps = 0/40 (0%)

Query 4 LTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNP 43 L ++ + K+ F + DKD G I ELG++++ + Sbjct 35 LKKKSADDVKKVFHILDKDKSGFIEEDELGSILKGFSSDA 74

3CLN

1RTP

3CLN

1RTP

Statistics of Local AlignmentsStatistics of Local Alignments

• For local pairwise alignments, best approach to determining statistical significance is to estimate an expect value (E value).

• The expect value E is the number of alignments with scores greater than or equal to score S (your score) that are expected to occur by chance in a database search.

• A score with an associated E value of 10-3 means that this particular score may occur 1 time out of 1000 alignments by chance.

• An E value is related to a probability value p.

• The key equation describing an E value is:

• E = Kmn e-S

Pevsner, Bioinformatics and Functional Genomics, 2009

EE = = KmnKmn e e--SS

• This equation is derived from a description of the extreme value distribution

• S = the score

• E = the expect value = the number of high-scoring segment pairs (HSPs) expected to occur with a score of at least S

• m, n = the length of two sequences

• , K = Karlin Altschul statistics

Some properties of the equation Some properties of the equation EE = = KmnKmn e e--SS

• The value of E decreases exponentially with increasing S (higher S values correspond to better alignments). Very high scores correspond to very low E values.

•The E value for aligning a pair of random sequences must be negative! Otherwise, long random alignments would acquire great scores

• Parameter K describes the search space (database).

• For E=1, one match with a similar score is expected to occur by chance. For a very much larger or smaller database, you would expect E to vary accordingly

From raw scores to bit scoresFrom raw scores to bit scores

• There are two kinds of scores: raw scores (calculated from a substitution matrix) and bit scores (normalized scores)

• Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes

S’ = bit score = (S - lnK) / ln2

The E value corresponding to a given bit score is:E = mn 2 -S’

Bit scores allow you to compare results between differentdatabase searches, even using different scoring matrices.

The expect value E is the number of alignmentswith scores greater than or equal to score Sthat are expected to occur by chance in a database search. A p value is a different way ofrepresenting the significance of an alignment.

p = 1 - e-

How to interpret BLAST: E values and p values

Very small E values are very similar to p values. E values of about 1 to 10 are far easier to interpretthan corresponding p values.

E p10 0.999954605 0.993262052 0.864664721 0.632120560.1 0.09516258 (about 0.1)0.05 0.04877058 (about 0.05)0.001 0.00099950 (about 0.001)0.0001 0.0001000

How to interpret BLAST: E values and p values

Biology 4900 Biocomputing. Chapter 4 BLAST BLAST BLAST allows user to search a sequence (the query) against millions of sequences in the NCBI database.

Documents

nucleotide sequence

nucleotide query sequence

protein query sequence

sequence query2

favorite query sequence

nucleotide sequence

protein sequence db

short sequence of letters