Top Banner
BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment
33

BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

BLAST and Multiple Sequence Alignment

Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment

Page 2: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Which program should one use?

Most researchers use methods for determining local similarities: Smith-Waterman (gold standard) FASTA BLAST }Do not find every possible alignment

of query with database sequence. Theseare used because they run faster than S-W

Page 3: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

BLAST

Basic Local Alignment Search Tool

Three phases:

1) List of high scoring words

2) Scan the sequence database

3) Extend hits

Page 4: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

The threshold and word size

The program declares a hit if the word taken from the query sequence has a score >= T when a scoring matrix is used.

This allows the word size (W) to be kept high (for speed) without sacrificing sensitivity.

If T is increased, the number of background hits is reduced and the program will run faster.

Page 5: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Phase 1: Compile a list of high-scoring words above threshold T.Query sequence: human p53: . . . RCPHHERCSD. . .Words derived from query sequence: RCP, CPH, PHH, HHE, …List of words above threshold T:

Word Scores from BLOSUM scoring matrix

Total score

RCP 5 + 9 + 7 21

KCP 2 + 9 + 7 18

QCP 1 + 9 + 7 17

ECP 0 + 9 + 7 16

Note: The line is located at the threshold.Word size is 3.

. . .

. . .

Page 6: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Phase 2: Scan the database for short segments that match the list of acceptable words/scores above or equal to threshold T.

Phase 3: Extend the hits and terminate when the tabulated score drops below a cutoff score.

Query EVVRRCPHHERCSD EVVRRCPHHER S+Sbjct EVVRRCPHHERSSE (Ch. hamster p53 O09185)

If the hit is extended far enough the query/subj segmentis called a High Scoring Segment Pair (HSP).

Page 7: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

What are the different BLAST programs?

blastp compares an amino acid query sequence against a protein sequence

database blastn compares a nucleotide query sequence against a nucleotide sequence

database blastx compares a nucleotide query sequence translated in all reading frames

against a protein sequence database tblastn compares a protein query sequence against a nucleotide sequence database

dynamically translated in all reading frames tblastx compares the six-frame translations of a nucleotide query sequence against

the six-frame translations of a nucleotide sequence database. Please note that tblastx program cannot be used with the nr database on the BLAST Web page.

Page 8: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

What are the different BLAST programs? (continued)

psi-blast Compares a protein sequence to a protein database. Performs the

comparison in an iterative fashion in order to detect homologs that are evolutionarily distant.

blast2 Compares two protein or two nucleotide sequences.

Page 9: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

The E value (false positive expectation value)

The Expect value (E) is a parameter that describes the number of “hits” one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially as the Similarity Score (S) increases (inverse relationship). The higher the Similarity Score, the lower the E value. Essentially, the E value describes the random background noise that exists for matches between two sequences. The E value is used as a convenient way to create a significance threshold for reporting results. When the E value is increased from the default value of 10 prior to a sequence search, a larger list with more low-similarity scoring hits can be reported. An E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size you might expect to see 1 match with a similar score simply by chance.

Page 10: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

E value (Karlin-Altschul statistics)E = K•m•n•e-λS

Where K is a scaling factor (constant), m is the length of the query sequence, n is the length of the database sequence, λ is the decay constant, S is the similarity score.

If S increases, E decreases exponentially.If the decay constant increases, E decreases exponentiallyIf m•n increases the “search space” increases and there is a greater

chance for a random “hit”, E increases. Larger database will increase E. However, larger query sequence often decreases E. Why???

Page 11: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Thought problem

A homolog to a query sequence resides in two databases. One is the UniProtKB/SwissProt database and the other is the PDB database. After performing BLAST search against the UniProtKB database you obtain an E value of 1. After performing the BLAST search against the PDB database you obtain an E value of 0.0625. What is the relative sizes of the two databases?

Page 12: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Using BLAST to get quick answers to bioinformatics problems

Task BLAST method Trad. Method

Predict protein function (1)

Perform blastp on PIR or Swiss-Prot database

Perform wet-lab experiment

Predict protein function (2)

Perform tblastn on NR database

Perform wet-lab experiment

Predict protein structure

Perform blastp against PDB

Structure prediction software, x-ray crystall., NMR

Page 13: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Using BLAST to get quick answers to bioinformatics problems (cont.)

Task BLAST method Trad. Method

Locate genes in a genome

Divide genome into 2-5 kb sequences. Perform blastx against NR protein datbase

Run gene prediction software. Perform microarray analysis or RNAs

Find distantly related proteins

Perform psi-blast No traditional method

Identify DNA sequence

Perform blastn Screen genomic DNA library

Page 14: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Filtering Repetitive Sequences

Over 50% of genomic DNA is repetitiveThis is due to: retrotransposons ALU region microsatellites centromeric sequences, telomeric sequences 5’ Untranslated Region of ESTs

Example of EST with simple low complexity region:

T27311GGGTGCAGGAATTCGGCACGAGTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTC

Page 15: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Filtering Repetitive Sequences and Masking

Options available for user.

Page 16: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

PSI-BLAST

PSI-position specific iterativea position specific scoring matrix (PSSM) is constructed automatically from multiple HSPs of initial BLAST search. Normal E value threshold is used.The PSSM is created as the new scoring matrix for a second BLAST search. A low E value threshold is used (E=.001).Result-1) obtains distantly related sequences

2) finds the important residues that provide function or structure.

Page 17: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

A PSSM

Page 18: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Steps to multiple alignment

Create Alignment

Edit the alignment to ensure that regions of functionalor structural similarity are preserved

PhylogeneticAnalysis

StructureAnalysis

Find conserved motifsto deduce function

Design ofPCR primers

Page 19: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Multiple Sequence Alignment

Collection of three or more protein (or nucleic acid) sequences partially or completely aligned.

Aligned residues tend to occupy corresponding positions in the 3-D structure of each aligned protein.

Page 20: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Practical use of MSA

Helps to place protein into a group of related proteins. It will provide insight into function, structure and evolution.

Helps to detect homologs

Identifies sequencing errors

Identifies important regulatory regions in the promoters of genes.

Page 21: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Clustal W (Thompson et al., 1994)

CLUSTAL=Cluster alignment

The underlying concept is that groups of sequences are phylogenetically related. If they can be aligned then one can construct a phylogenetic tree.

Page 22: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Flowchart of computation steps in Clustal W (Thompson et al., 1994)

Pairwise alignment: calculation of distance matrix

Creation of unrooted neighbor-joining tree

Rooted nJ tree (guide tree) and calculation of sequence weights

Progressive alignment following the guide tree

Page 23: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Step 1-Pairwise alignments

Compare each sequence with eachother and calculate a distance matrix.

A -

B .87 -

C .59 .60 -

A B C

Each number represents the numberof exact matches divided by thesequence length (ignoring gaps).Thus, the higher the number the moreclosely related the two sequences are.

In this matrix, sequence A is 87% identical to sequence B

Different sequences

Page 24: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Step 1-Pairwise alignments

Compare each sequence with eachother and pairwise alignment scores

human EYSGSSEKIDLLASDPHEALICKSERVHSKSVESNIEDKIFGKTYRKKASLPNLSHVTEN 480dog EYSGSSEKIDLMASDPQDAFICESERVHTKPVGGNIEDKIFGKTYRRKASLPKVSHTTEV 477mouse GGFSSSRKTDLVTPDPHHTLMCKSGRDFSKPVEDNISDKIFGKSYQRKGSRPHLNHVTE 476

Page 25: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Step 1-Calculation of Distance Matrix

Use the Distance Matrix to create a Guide Tree todetermine the “order” of the sequences.

I =D = 1 – (I) D = Difference score

# of identical aa’s in pairwise global alignmenttotal number of aa’s in shortest sequence

Hbb-Hu 1 -

Hbb-Ho 2 .17 -

Hba-Hu 3 .59 .60 -

Hba-Ho 4 .59 .59 .13 -

Myg-Ph 5 .77 .77 .75 .75 -

Gib-Pe 6 .81 .82 .73 .74 .80 -

Lgb-Lu 7 .87 .86 .86 .88 .93 .90 -

1 2 3 4 5 6 7

Page 26: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Step 2-Create unrooted NJ tree

Hba-Ho

Hba-Hu

Hbb-Ho

Hbb-Hu

Myg-Ph

Gib-Pe

Lgb-Lu

Page 27: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Step 3-Create Rooted NJ Tree

Weight

AlignmentOrder of alignment:1 Hba-Hu vs Hba-Ho2 Hbb-Hu vs Hbb-Ho3 A vs B4 Myg-Ph vs C5 Gib-Pe vs D6 Lgh-Lu vs E

Page 28: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Step 4-Progressive alignment

Page 29: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Step 4-Progressive alignment

Scoring duringprogressivealignment

Page 30: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Rules for alignment

Short stretches of 5 hydrophilic residues often indicate loop or random coil regions (not essential for structure) and therefore gap penalties are reduced reduced for such stretches.Gap penalties for closely related sequences are lowered compared to more distantly related sequences (“once a gap always a gap” rule). It is thought that those gaps occur in regions that do not disrupt the structure or function.Alignments of proteins of known structure show that proteins gaps do not occur more frequently than every eight residues. Therefore penalties for gaps increase when required at 8 residues or less for alignment. This gives a lower alignment score in that region.A gap weight is assigned after each aa according the frequency that such a gap naturally occurs after that aa in nature

Page 31: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Amino acid weight matrices

As we know, there are many scoring matrices that one can use depending on the relatedness of the aligned proteins.As the alignment proceeds to longer branches the aa scoring matrices are changed to more divergent scoring matrices. The length of the branch is used to determine which matrix to use and contributes to the alignment score.

Page 32: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Example of Sequence Alignment using Clustal W

Asterisk represents identity: represents high similarity. represents low similarity

Page 33: BLAST and Multiple Sequence Alignment Learning objectives-Learn the basics of BLAST, Psi-BLAST, and multiple sequence alignment.

Multiple Alignment Considerations

Quality of guide tree. It would be good to have a set of closely related sequences in the alignment to set the pattern for more divergent sequences.If the initial alignments have a problem, the problem is magnified in subsequent steps.CLUSTAL W is best when aligning sequences that are related to each other over their entire lengthsDo not use when there are variable N- and C- terminal regionsIf protein is enriched for G,P,S,N,Q,E,K,R then these residues should be removed from gap penalty list. (what types of residues are these?)

Reference: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalW/