Basic bioinformatics concepts, databases and tools Module 2 Searching for similar sequences Joachim Jacob http://www.bits.vib.be Updated February 2012 http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod2-intro_H1_2012_SimSearch.pdf
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Basic bioinformatics concepts, databases and tools
Module 2
Searching for similar sequences
Joachim Jacobhttp://www.bits.vib.be
Updated February 2012 http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod2-intro_H1_2012_SimSearch.pdf
In this module, we will look into sequence similarity to get and analyze sequences
Detecting sequence similarity is a cornerstone-type of analysis in bioinformatics
Why would we like to detect similar sequences? 1. Searching in sequence databases for similar sequences
2. From a high-throughput experiment, every read needs to be 'aligned' to a genomic reference sequence or to each other (assembly)
3. Elucidation of functionality by detecting sites of conservation (sequences parts that resemble each other more than would be expected)
4. Phylogeny is build upon comparison of multiple sequences
How to search for sequence similarity
- in sequence databases, via BLAST or FASTA
- to compare the sequence of two sequences in detail: pairwise sequence comparison
- to compare multiple sequences at once: multiple sequence alignment, de novo assembly
Methods can be categorized into:
Optimal/exhaustive – heuristic – Graphical
Comparing sequences can be classified into 'one to one', 'one to many', or 'many to many'
One to many
One to one
many to many
Conceptualizing the source of sequence similarity
Sequences can be similar because ...
they are derived from evolutionary related organisms
they evolve in similar conditions: convergent evolution
So the first question we have to solve, what is similar? How do we measure similarity?
Similar?
The source of sequence similarity
Similar?
Summary of the really occurred changes
Let's assume this toy example, a short sequence, mutating over time (without insertions of deletions) occurring.
Similar?
So taking the most divergent sequences (the first and the last), the only correct alignment for those two, regarding their history, is:
Similar?
KLRMWILVATAEIDD
KPRMCILVAIADIRD
But we usually don't have all intermediate sequences: only the first and the last. How to determine what is the correct alignment?
In addition, multiple changes can have happened at one location over time
Similar? KLRMWILVATAEIDDKPRMCILVAIADIRD
KLRMWILVATAEIDD
KLRMWILVATAEIDD
KPRMCILVAIADIRD
KPRMCILVAIADIRDKLRMWILVATAEIDD
Many possibilities exist to align them: drag the sequences over each other. One of those positions, will have highest number of identical residues, called matches (green)
KPRMCILVAIADIRD
Similar? KLRMWILVATAEIDDKPRMCILVAIADIRD
KLRMWILVATAEIDD
KLRMWILVATAEIDD
KPRMCILVAIADIRD
KPRMCILVAIADIRDKLRMWILVATAEIDD
In this example, we base our claim 'we have a match' if we see an identical residue on that position in both sequences.
KPRMCILVAIADIRD
The identity matrix summarizes this scoring system, listing all residue combinations in a table
A C Y W
A C
Y W
Residue
match
mismatch
Substitutions or score matrices provide a means to determine similarity in an objective way
Such matrices are called substitution or scoring matrices. They are used to calculate a score for every possible AA alignment in aligned sequences, in order have a measure for sequence similarity.
KLRMWILVATAEIDDKPRMCILVAIADIRD
KLRMWILVATAEIDD
KPRMCILVAIADIRD KPRMCILVAIADIRDKLRMWILVATAEIDD
Score: 0 1 0 0 0 0 0 0 0 0
Sum of the scores: 1
Score: 0 1 0 0 0 0 0 0 0 1 0 1
Sum of the scores: 2
Score: 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1Sum of the scores: 11
Complex substitutions matrices are more meaningful and sensitive to detect similarity
The two most popular are PAM and BLOSUM. Every pair of aligned residues get a score, based on the matrix. E.g. an A-A alignment gets score 2 (PAM120) or 4 (BLOSUM62). An F-G gets -5 or -3.
The substitution matrices capture the similarity in properties between residues
From Livingstone, C. D. and Barton, G. J. (1993),"Protein Sequence Alignments: A Strategy for the Hierarchical Analysis of Residue Conservation", Comp. Appl. Bio. Sci., 9, 745-756.
A matrix does not capture insertions and deletions: penalties are given to deal with them
When two sequences align, the relation between aligned residues can only been seen as one of the following:
identity Mismatch (substitution (DNA) or similarity level (protein))
gap (insertion/deletion)
The two parts of the gap penalty: a higher penalty for creation, one lower for its extension
Score from substitution matrix
Gap penalty
How to search for sequence similarity
- In sequence databases: BLAST or FASTA
- to compare the sequence of two sequences in detail: pairwise sequence comparison
- to compare sequences of multiple sequences: multiple sequence alignment
3 methods exist:
Graphical – optimal/exhaustive – heuristic
Substitutions matrices are used in many algorithms to detect sequence similarity
One to many
One to one
many to many
Pairwise sequence comparison – one to one
To create an alignment between two sequences
- Manually (?)
- Two sequences (= pairwise alignment): optimal alignment through 'dynamic programming'
A graphical method: dot plots can be made to rapidly identify regions with similar sequence
The parameters of a dotplot (which uses the identity matrix), are the word size (e.g. per 3 residues) and the threshold (% of a word that are identities). This is very convenient for large molecules, e.g. chromosomes
Software for making dotplots EMBOSS contains following programs
– dottup : word comparison– dotmatcher : window/threshold comparison– dottup : word comparison, makes n*n dotplots in one
graph Dotter
developed by Erik Sonnhammer and Richard Durbin (U. Stockholm, Sweden)
Dotlet – (Java applet) at the Swiss Institute of Bioinformatics
Gepard – (Munich Information center for Protein Sequences,
Germany) : with heuristic for speeding up computation, for comparing very long sequences
Notredame et al. (2002), “T-Coffee: A novel method for fast and accurate multiple sequence alignment,” J Mol Biol 302:205. [Introduced notion of consistency]
Blackshields et al. (2010), “Sequence embedding for fast construction of guide trees for multiple sequence alignment,” Algorithms for Mol Biol 5:21. [mBed algorithm]
Thompson et al. (2005), “BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark,” Proteins 61:127.
How to search for sequence similarity
- In sequence databases: BLAST or FASTA
- to compare the sequence of two sequences in detail: pairwise sequence comparison
- to compare sequences of multiple sequences: multiple sequence alignment
3 methods exist:
Graphical – optimal/exhaustive – heuristic
Divided in 'one to one', 'one to many', or 'many to many' sequence comparisons
One to many
One to one
many to many
Searching sequence databases is done through a little trick
Problem
find me all similar sequences to a query sequence in a database.
(Find me the position of many short reads in a genome)
Bottleneck:
we cannot compute an optimal alignment for every sequence and determine which is best (~MSA). This is time-consuming, only practicable on special computer (parallel computer or computer cluster)
"Heuristic" algorithm : gain of speed at the expense of some loss in sensitivity
• BLAST (developed by S. Altschul et al. at NCBI)
• fastA (developed by R. Pearson at U. of Virginia)
BLAST finds quickly similar sequences by giving up some sensitivity
bit score : score corrected for scale of scoring scheme
P = 1 - e - E
S’λ S - ln K
ln 2=
*
Probability P that databank yields by pure chance at least one alignment with same or higer score
Interpreting the BLAST results by E-value and bit score
E-value: the lower the better (= chance to obtain such a similarity by chance with a random sequence and database of the same size) (e.g. 0.1 means 1 in 10 searches, this similarity could have arosen by chance alone)
Max/Total score: bit score – the higher the better (= score constructed from length of total alignment of the high scoring pair)
Depending on DNA and/or protein sequences as query or in the db, you choose a BLAST version
Different flavours of BLAST
Depending on query sequence: DNA or protein
and database: DNA or protein
Flavour: query - databaseblastn: DNA - DNAblastp: protein - proteinblastx: translated DNA - proteintblastn: protein - tr DNA tblastx: tr DNA - tr DNA
You can adjust few parameters to the BLAST algorithm
SEG filter for proteins
DUST filter for nucleic acids
E-value threshold for searching: rule of thumb: Good >1e-05 > weak similarity >1e-01> take a good look > 10
Higher word size = sensitivity up
A lot of power lies within choosing the right database for the BLAST search.
The choice of the database
The "nr/nt" database is the largest nucleotide database available through NCBI BLAST; select the "nr/nt" database for this exercise. It includes all GenBank, RefSeq Nucleotides, EMBL (European nucleotide database), DDBJ (Japanese nucleotide database) and PDB (Protein Data Bank) sequences, but no EST, STS, GSS, or phase 0, 1 or 2 htgs (unfinished high throughput genomic) sequences. The NCBI nr database originally got its name from the phrase "nonredundant" nucleotide database, but there is no longer any claim to nonredundancy in the sequence set.
Nearly every sequence database comes with BLAST services nowadays
Numerous online websites, mostly WU-BLAST (NCBI)
http://blast.ncbi.nlm.nih.gov/Blast.cgi
http://www.ebi.ac.uk/Tools/sss/
But very easy to install on own computer ('run locally')
1. Download blast programs ( here )
2. Format your 'database' (multifasta file)
3. Run BLAST
You can also choose to use NCBI Blast online outside of the browser by using netblast (instructions here)