Top Banner
Basic bioinformatics concepts, databases and tools Module 2 Searching for similar sequences Joachim Jacob http://www.bits.vib.be Updated February 2012 http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod2-intro_H1_2012_SimSearch.pdf
65

BITS: Basics of Sequence similarity

May 10, 2015

Download

Education

Module 2 Sequence similarity.

Part of bioinformatics training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BITS: Basics of Sequence similarity

Basic bioinformatics concepts, databases and tools

Module 2

Searching for similar sequences

Joachim Jacobhttp://www.bits.vib.be

Updated February 2012 http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod2-intro_H1_2012_SimSearch.pdf

Page 2: BITS: Basics of Sequence similarity

Based on annotations, we can use text searching to get sequences of interest (module 1)

WHERE

– Primary dbs

– Derived dbs

HOW to find sequences

by keywords

by literature

by annotation

See BITS website - module 1

Page 3: BITS: Basics of Sequence similarity

In this module, we will look into sequence similarity to get and analyze sequences

Page 4: BITS: Basics of Sequence similarity

Detecting sequence similarity is a cornerstone-type of analysis in bioinformatics

Why would we like to detect similar sequences? 1. Searching in sequence databases for similar sequences

2. From a high-throughput experiment, every read needs to be 'aligned' to a genomic reference sequence or to each other (assembly)

3. Elucidation of functionality by detecting sites of conservation (sequences parts that resemble each other more than would be expected)

4. Phylogeny is build upon comparison of multiple sequences

Page 5: BITS: Basics of Sequence similarity

How to search for sequence similarity

- in sequence databases, via BLAST or FASTA

- to compare the sequence of two sequences in detail: pairwise sequence comparison

- to compare multiple sequences at once: multiple sequence alignment, de novo assembly

Methods can be categorized into:

Optimal/exhaustive – heuristic – Graphical

Comparing sequences can be classified into 'one to one', 'one to many', or 'many to many'

One to many

One to one

many to many

Page 6: BITS: Basics of Sequence similarity

Conceptualizing the source of sequence similarity

Sequences can be similar because ...

they are derived from evolutionary related organisms

they evolve in similar conditions: convergent evolution

Page 7: BITS: Basics of Sequence similarity

So the first question we have to solve, what is similar? How do we measure similarity?

Similar?

The source of sequence similarity

Page 8: BITS: Basics of Sequence similarity

Similar?

Summary of the really occurred changes

Let's assume this toy example, a short sequence, mutating over time (without insertions of deletions) occurring.

Page 9: BITS: Basics of Sequence similarity

Similar?

So taking the most divergent sequences (the first and the last), the only correct alignment for those two, regarding their history, is:

Page 10: BITS: Basics of Sequence similarity

Similar?

KLRMWILVATAEIDD

KPRMCILVAIADIRD

But we usually don't have all intermediate sequences: only the first and the last. How to determine what is the correct alignment?

In addition, multiple changes can have happened at one location over time

Page 11: BITS: Basics of Sequence similarity

Similar? KLRMWILVATAEIDDKPRMCILVAIADIRD

KLRMWILVATAEIDD

KLRMWILVATAEIDD

KPRMCILVAIADIRD

KPRMCILVAIADIRDKLRMWILVATAEIDD

Many possibilities exist to align them: drag the sequences over each other. One of those positions, will have highest number of identical residues, called matches (green)

KPRMCILVAIADIRD

Page 12: BITS: Basics of Sequence similarity

Similar? KLRMWILVATAEIDDKPRMCILVAIADIRD

KLRMWILVATAEIDD

KLRMWILVATAEIDD

KPRMCILVAIADIRD

KPRMCILVAIADIRDKLRMWILVATAEIDD

In this example, we base our claim 'we have a match' if we see an identical residue on that position in both sequences.

KPRMCILVAIADIRD

Page 13: BITS: Basics of Sequence similarity

The identity matrix summarizes this scoring system, listing all residue combinations in a table

A C Y W

A C

Y W

Residue

match

mismatch

Page 14: BITS: Basics of Sequence similarity

Substitutions or score matrices provide a means to determine similarity in an objective way

Such matrices are called substitution or scoring matrices. They are used to calculate a score for every possible AA alignment in aligned sequences, in order have a measure for sequence similarity.

KLRMWILVATAEIDDKPRMCILVAIADIRD

KLRMWILVATAEIDD

KPRMCILVAIADIRD KPRMCILVAIADIRDKLRMWILVATAEIDD

Score: 0 1 0 0 0 0 0 0 0 0

Sum of the scores: 1

Score: 0 1 0 0 0 0 0 0 0 1 0 1

Sum of the scores: 2

Score: 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1Sum of the scores: 11

Page 15: BITS: Basics of Sequence similarity

Complex substitutions matrices are more meaningful and sensitive to detect similarity

The two most popular are PAM and BLOSUM. Every pair of aligned residues get a score, based on the matrix. E.g. an A-A alignment gets score 2 (PAM120) or 4 (BLOSUM62). An F-G gets -5 or -3.

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Scoring2.htmlhttp://biology.unm.edu/biology/maggieww/Public_Html/444544seqsim.html

Likely changes: positive score - unlikely changes: negative score

Page 16: BITS: Basics of Sequence similarity

Substitution matrices are derived from analysis of multiple alignments of related sequences

PAM (Point Accepted Mutations) by Margaret Dayhoff :global alignments of proteins with >85% identity --> phylogenetic

trees --> count substitutions --> estimate prob. conservation/substitution at distance of 1 mutation per 100 aa ==> PAM1 table

PAMn tables by matrix multiplication

BLOSUM (BLOCKS Substitution Matrices) by Henikoff and Henikoff : BLOCKS (= local multiple sequence

alignment without gaps) databank made from protein families from PROSITE databank -->

BLOSUMn table derived from BLOCK with >n% conserved aa

http://en.wikipedia.org/wiki/Substitution_matrix

Page 17: BITS: Basics of Sequence similarity

A B C D E F G H I K L M N P Q R S T V W X Y Z A 4 -2 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -1 -2 -1 B -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 C 0 -3 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2 -4 D -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 E -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5 F -2 -3 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 -1 3 -3 G 0 -1 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -1 -3 -2 H -2 -1 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 -1 2 0 I -1 -3 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1 -3 K -1 -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -1 -2 1 L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1 -3 M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 -1 -2 N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -1 -2 0 P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -1 -3 -1 Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 -1 2 R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -1 -2 0 S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -1 -2 0 T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -1 -2 -1 V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 -1 -2 W -3 -4 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 -1 2 -3 X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 7 -2 Z -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5

The BLOSUM62 similarity matrix

ftp://ftp.ncbi.nih.gov/blast/matrices/

Page 18: BITS: Basics of Sequence similarity

The substitution matrices capture the similarity in properties between residues

From Livingstone, C. D. and Barton, G. J. (1993),"Protein Sequence Alignments: A Strategy for the Hierarchical Analysis of Residue Conservation", Comp. Appl. Bio. Sci., 9, 745-756.

Page 19: BITS: Basics of Sequence similarity
Page 20: BITS: Basics of Sequence similarity

A matrix does not capture insertions and deletions: penalties are given to deal with them

When two sequences align, the relation between aligned residues can only been seen as one of the following:

identity Mismatch (substitution (DNA) or similarity level (protein))

gap (insertion/deletion)

Page 21: BITS: Basics of Sequence similarity

The two parts of the gap penalty: a higher penalty for creation, one lower for its extension

Score from substitution matrix

Gap penalty

Page 22: BITS: Basics of Sequence similarity

How to search for sequence similarity

- In sequence databases: BLAST or FASTA

- to compare the sequence of two sequences in detail: pairwise sequence comparison

- to compare sequences of multiple sequences: multiple sequence alignment

3 methods exist:

Graphical – optimal/exhaustive – heuristic

Substitutions matrices are used in many algorithms to detect sequence similarity

One to many

One to one

many to many

Page 23: BITS: Basics of Sequence similarity

Pairwise sequence comparison – one to one

To create an alignment between two sequences

- Manually (?)

- Two sequences (= pairwise alignment): optimal alignment through 'dynamic programming'

– Needleman-Wunsch (global alignment)– Smith-Waterman (local alignment)

Page 24: BITS: Basics of Sequence similarity

Dynamic programming uses a gap penalty and a scoring scheme to align two sequences

Dynamic programming: two things needed

Scoring scheme to measure identity and similarity

• choose a scoring matrix for similarity and identity (e.g. PAM250)

Gap penalty

• For each gap, a penalty in the ultimate score is given, also called weight, or cost

most used : a + b * (n-1) for gap of n positionsa : gap opening penalty, higher penalty (negative

score)

b : gap extension penalty, smaller penalty to widen a gap

http://biology.unm.edu/biology/maggieww/Public_Html/444544seqsim.html

Page 25: BITS: Basics of Sequence similarity

Dynamic programming and backtracking

A T

T

T

T

-

-

C

0 -1 -2 -3

-1 0 -1

-3 -1 0

-1

-3

-2 -2 0 +1

A T

T

T

T

-

-

C

0 -1 -2 -3

-1 0 -1

-3 -1 0

-1

-3

-2 -2 0 +1

-1

-1

-1

-3

-1

-1

-1

-1

-2

-2

-2

-2

-2

-2

-3

-3

-3

-4

-3

-3

-2

-2

-3

-3

-4

-4

0

0

0

+1

-1

-1

-2

A T T -- T T C

Si-1,j-1

+s(ai,bj)Si,j-1

+s(-,bj)

Si,j

Si-1,j

+s(ai,-)

source

target

scoring scheme :• s(ai,bi) = +1 if ai = bi

• s(ai,bi) = -1 if ai = bi

• s(ai,-) = -1• s(-,bi) = -1

/

Page 26: BITS: Basics of Sequence similarity

A

Bglobal Alignment

local Alignment

A

B

Needleman - Wunsch algorithmconsiders similarity across the full extent of the sequences

Smith - Waterman algorithmfocuses on regions of similarity in parts of the sequences

Two approaches to align pairwise : align globally versus locally

Page 27: BITS: Basics of Sequence similarity

Software for creating an optimal pairwise alignment

Best global alignment (Needleman – Wunsch)EMBOSS needle (webinterface here, here on Mobyle)

EMBOSS stretcher (with Myers-Miller optimization, for very long sequences – webinterface on Mobyle)

Best local alignment (Smith-Waterman)EMBOSS water

SIM (Huang and Miller, with optimization for very long sequences, can also find non-overlapping suboptimal alignments) (link)

EMBOSS matcher (idem as SIM)modified version of SIM (by Laurent Duret) with output for graphical viewer

http://mobyle.pasteur.fr/cgi-bin/portal.py#welcome

Page 28: BITS: Basics of Sequence similarity

Parameters that are set

Page 29: BITS: Basics of Sequence similarity

A graphical method: dot plots can be made to rapidly identify regions with similar sequence

The parameters of a dotplot (which uses the identity matrix), are the word size (e.g. per 3 residues) and the threshold (% of a word that are identities). This is very convenient for large molecules, e.g. chromosomes

Page 30: BITS: Basics of Sequence similarity

Software for making dotplots EMBOSS contains following programs

– dottup : word comparison– dotmatcher : window/threshold comparison– dottup : word comparison, makes n*n dotplots in one

graph Dotter

developed by Erik Sonnhammer and Richard Durbin (U. Stockholm, Sweden)

Dotlet – (Java applet) at the Swiss Institute of Bioinformatics

Gepard – (Munich Information center for Protein Sequences,

Germany) : with heuristic for speeding up computation, for comparing very long sequences

...

http://www.bits.vib.be/wiki/index.php/Dotplot

Page 31: BITS: Basics of Sequence similarity

Dot plots generate typical patterns which can be interpreted

http://www.code10.info/index.php?option=com_content&view=article&id=64:inroduction-to-dot-plots&catid=52:cat_coding_algorithms_dot-plots&Itemid=76

Sequence A

Sequence B

Simple repeat

Insertion in sequence B

Insertion in sequence A

Complex repeat

Palindrome

Page 32: BITS: Basics of Sequence similarity

How to search for sequence similarity

- In sequence databases: BLAST or FASTA

- to compare the sequence of two sequences in detail: pairwise sequence comparison

- to compare sequences of multiple sequences: multiple sequence alignment

3 methods exist:

Graphical – optimal/exhaustive – heuristic

Multiple sequence alignment

One to many

One to one

many to many

Page 33: BITS: Basics of Sequence similarity

Multiple sequence alignment is not simply expanding pairwise sequence alignmentsMany to many

One could try to dynamically program to time consuming: 20 seqs need already more time than the universe has existed...

Heuristic methods lead the way: "progressive alignment" most used

1. Use pairwise dynamic programming for all sequences

2. Guide tree is constructed based on scores

3. Two sequences are aligned, and sequentially every sequence is added following the guide tree (progressive clustering)

Page 34: BITS: Basics of Sequence similarity

multiple sequencealignment

progressive clustering

progressive alignment"once a gap, always a gap"

ABCDguide tree

A B C

B 142

C 95 101

D 60 62 55

similarity matrix

N (N-1)2

pairwise sequencealignments

Progressive clustering is a two-step process: measuring distance and constructing alignment

Take meanOf AB

Take mean of ABC

STEP 1:Measure similarity

STEP 2: construct MSA

Page 35: BITS: Basics of Sequence similarity

Progressive clustering: once a gap, always a gap

Page 36: BITS: Basics of Sequence similarity

The guide tree is NOT a phylogenetic tree !

Page 37: BITS: Basics of Sequence similarity

The progressive alignment framework can be extended to make it faster and more sensitive

More sensitive:

- consistency: per position scoring scheme (T-Coffee)

- structural guidance: based on structural info alignment is guided (Expresso)

Faster:

- distance measured by analysing k-tuples (see later) instead of pairwise aligning (Clustal Omega)

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.0030123

Page 38: BITS: Basics of Sequence similarity

Different formats of aligned sequences exists

http://www.bits.vib.be/wiki/index.php/Multiple_sequence_alignment#Formats

h1_pea -MATEEPIVAVETVPEPIVTEPTTITEPEVPEKEEPKAEVEKTKKAKGSKPKKASKPRNPh1_sollc -MATEEPVIVNEVVEEQAA--PETVKDEANPPAKSGKAKKETKAKKPAAPRKRSATP---h11_volca MSETEAAPVVAPAAEAAPAAEAPKAKAPKAKAPKQPKAPKAPKEPKAPKEKKPKAAP---

h1_pea ASHPTYEEMIKDAIVSLKEKNGSSQYAIAKFIEEKQ-KQLP-ANFKKLLLQNLKKNVASGh1_sollc -THPPYFEMIKDAIVTLKERTGSSQHAITKFIEEKQ-KSLP-SNFKKLLLTQLKKFVASEh11_volca -THPPYIEMVKDAITTLKERNGSSLPALKKFIENKYGKDIHDKNFAKTLSQVVKTFVKGG

3 298h1_pea -MATEEPIVA VETVPEPIVT EPTTITEPEV PEKEEPKAEV EKTKKAKGSK h1_sollc -MATEEPVIV NEVVEEQAA- -PETVKDEAN PPAKSGKAKK ETKAKKPAAP h11_volca MSETEAAPVV APAAEAAPAA EAPKAKAPKA KAPKQPKAPK APKEPKAPKE

PKKASKPRNP ASHPTYEEMI KDAIVSLKEK NGSSQYAIAK FIEEKQ-KQL RKRSATP--- -THPPYFEMI KDAIVTLKER TGSSQHAITK FIEEKQ-KSL KKPKAAP--- -THPPYIEMV KDAITTLKER NGSSLPALKK FIENKYGKDI

Clustal format

Phylip format

Page 39: BITS: Basics of Sequence similarity

Software that implement these algorithms and manually adjust the alignments

Alignment editors

- SeaView

- SeqPup

- GeneDoc

- Jalview

- BioEdit

- CLC Sequence Viewer

- UGene

http://www.bits.vib.be/wiki/index.php/Multiple_sequence_alignment

Page 40: BITS: Basics of Sequence similarity

Additional references

Notredame et al. (2002), “T-Coffee: A novel method for fast and accurate multiple sequence alignment,” J Mol Biol 302:205. [Introduced notion of consistency]

Blackshields et al. (2010), “Sequence embedding for fast construction of guide trees for multiple sequence alignment,” Algorithms for Mol Biol 5:21. [mBed algorithm]

Söding (2005), “Protein homology detection by HMM-HMM comparison,” Bioinformatics 21:951. [HHalign algorithm]

Thompson et al. (2005), “BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark,” Proteins 61:127.

Page 41: BITS: Basics of Sequence similarity

How to search for sequence similarity

- In sequence databases: BLAST or FASTA

- to compare the sequence of two sequences in detail: pairwise sequence comparison

- to compare sequences of multiple sequences: multiple sequence alignment

3 methods exist:

Graphical – optimal/exhaustive – heuristic

Divided in 'one to one', 'one to many', or 'many to many' sequence comparisons

One to many

One to one

many to many

Page 42: BITS: Basics of Sequence similarity

Searching sequence databases is done through a little trick

Problem

find me all similar sequences to a query sequence in a database.

(Find me the position of many short reads in a genome)

Bottleneck:

we cannot compute an optimal alignment for every sequence and determine which is best (~MSA). This is time-consuming, only practicable on special computer (parallel computer or computer cluster)

"Heuristic" algorithm : gain of speed at the expense of some loss in sensitivity

• BLAST (developed by S. Altschul et al. at NCBI)

• fastA (developed by R. Pearson at U. of Virginia)

Page 43: BITS: Basics of Sequence similarity

BLAST finds quickly similar sequences by giving up some sensitivity

Algorithm (= steps to follow to reach your goal)

http://www.ncbi.nlm.nih.gov/books/NBK21097/

Page 44: BITS: Basics of Sequence similarity

BLAST step 1 : neighbouring

BLAST step 2 : searching the little words in the db

Page 45: BITS: Basics of Sequence similarity

BLAST step 3 : extend where the words match

Only if words match >Sg score

Proteins: only extension if another hit <40

Proteins: optimal composition adapted

Page 46: BITS: Basics of Sequence similarity

Each BLAST search hit has an E-value, which is how many hits we expect by chance

E() = m n K e - λ ∗ S

query sequence length m total databank length n

K and λ parameters obtained by simulation(search random sequence against random databank)

Expect value : number of unrelated databank sequences expected to yield same or higher score S by pure chance (extreme value distribution)

***

http://bioinfo.uncc.edu/zhx/binf8312/lecture-7-SequenceAnalyses.pdf

Page 47: BITS: Basics of Sequence similarity

BLAST statistics

bit score : score corrected for scale of scoring scheme

P = 1 - e - E

S’λ S - ln K

ln 2=

*

Probability P that databank yields by pure chance at least one alignment with same or higer score

Page 48: BITS: Basics of Sequence similarity

Interpreting the BLAST results by E-value and bit score

E-value: the lower the better (= chance to obtain such a similarity by chance with a random sequence and database of the same size) (e.g. 0.1 means 1 in 10 searches, this similarity could have arosen by chance alone)

Max/Total score: bit score – the higher the better (= score constructed from length of total alignment of the high scoring pair)

Page 49: BITS: Basics of Sequence similarity

Depending on DNA and/or protein sequences as query or in the db, you choose a BLAST version

Different flavours of BLAST

Depending on query sequence: DNA or protein

and database: DNA or protein

Flavour: query - databaseblastn: DNA - DNAblastp: protein - proteinblastx: translated DNA - proteintblastn: protein - tr DNA tblastx: tr DNA - tr DNA

Page 50: BITS: Basics of Sequence similarity

You can adjust few parameters to the BLAST algorithm

SEG filter for proteins

DUST filter for nucleic acids

E-value threshold for searching: rule of thumb: Good >1e-05 > weak similarity >1e-01> take a good look > 10

Higher word size = sensitivity up

Page 51: BITS: Basics of Sequence similarity

A lot of power lies within choosing the right database for the BLAST search.

The choice of the database

The "nr/nt" database is the largest nucleotide database available through NCBI BLAST; select the "nr/nt" database for this exercise. It includes all GenBank, RefSeq Nucleotides, EMBL (European nucleotide database), DDBJ (Japanese nucleotide database) and PDB (Protein Data Bank) sequences, but no EST, STS, GSS, or phase 0, 1 or 2 htgs (unfinished high throughput genomic) sequences. The NCBI nr database originally got its name from the phrase "nonredundant" nucleotide database, but there is no longer any claim to nonredundancy in the sequence set.

Page 52: BITS: Basics of Sequence similarity

Nearly every sequence database comes with BLAST services nowadays

Numerous online websites, mostly WU-BLAST (NCBI)

http://blast.ncbi.nlm.nih.gov/Blast.cgi

http://www.ebi.ac.uk/Tools/sss/

But very easy to install on own computer ('run locally')

1. Download blast programs ( here )

2. Format your 'database' (multifasta file)

3. Run BLAST

You can also choose to use NCBI Blast online outside of the browser by using netblast (instructions here)

http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/

Page 53: BITS: Basics of Sequence similarity

Some adjustments to the BLAST protocol exist for particular purposes

Identifying very distantly related proteins

PSI-BLAST (position specific iterated) (see module 3)

BLAST protein with matching of a pattern

PHI-BLAST (pattern hit initiated) (see module 3)

BLAST highly similar nucleotide sequences

Mega-BLAST

LastZ explanation – have a look at the dotplots here

Page 54: BITS: Basics of Sequence similarity

BLAST2SEQ aligns 2 sequences and visualises the output in a dotplot-like graph

The tool to do this is called BLAST2SEQ: e.g. comparing chrI with ChrVIII of S. cerevisiae

insertions!

http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&PROG_DEF=blastn&BLAST_PROG_DEF=megaBlast&SHOW_DEFAULTS=on&BLAST_SPEC=blast2seq&LINK_LOC=align2seq

Page 55: BITS: Basics of Sequence similarity

BLAT is derived from BLAST, and is used for searching very similar sequences in a genome

BLAT = BLAST like alignment tool

- database is a genome sequence

- the database are not files, but is kept into memory as words of size 11

- it can only be used for very similar sequences: if you have a fragment which you want to know the position in the genome.

Page 56: BITS: Basics of Sequence similarity

The technique of indexing 'words' is also used in some short read aligners

In BLAST for nucleotides: k = 11 (11-mers)

11111111111 → 11 consecutive matches

However, non-consecutive matches improve sensitivity: a spaced seed.

111010010100110111 → 55% more sensitive• 1 means a match, a 0 means a don't care

position– Key size: number of 1's

– Key width: total number of 0's and 1's

• The 'keys' are used to index the genome or the reads, depending on the aligner

doi: 10.1093/bib/bbq015 on http://dx.doi.org

Page 57: BITS: Basics of Sequence similarity

FastA is another popular sequence database search algorithm

Find runs of identities Rescore using PAM matrix andKeep top scoring segments

Page 58: BITS: Basics of Sequence similarity

FastA is another popular sequence database search algorithm

Apply 'joining threshold' toeliminate segments that are

unlikely to be part of the alignmentthat includes the highest

scoring alignment

Use dynamic programming tooptimize the alignment in a

narrow band that encompasses the topscoring segments

Page 59: BITS: Basics of Sequence similarity

FastA is accessible on the website of EBI

Further explanation of algorithm: here

Accessibility• EBI (help link)

FastA developers: link

Page 60: BITS: Basics of Sequence similarity

The interpretation of FastA output is similar as for BLAST.

http://www.ebi.ac.uk/Tools/sss/fasta/

Page 61: BITS: Basics of Sequence similarity

Similarity you observe, homology you infer

Interpreting results

Sequences are similar if their similarity score is significantly higher than that of random sequences of same length and composition.

Sequences are homologous if they are similar because they diverged from a common ancestor.

Sequences are analogous if they are similar because of convergent evolution (e.g. binding sites for same ligand)

Similarity you observe, homology you infer !

You can speak of %similarity or %identity, not of %homology !

Page 62: BITS: Basics of Sequence similarity

Homology: orthologous and paralogous (in- and out-)

(out)

(in)

Page 63: BITS: Basics of Sequence similarity

Summary sequence similarity

Pairwise (one to one)

– Dotplot (graphical)

– Smith-waterman / needleman-wunsch (optimal)

Multiple sequence alignment (many to many) (heuristic)

– ClustalW

– Muscle, ...

Database search (one to many) (heuristic)

– BLAST

– FastA

– BLAT

Page 64: BITS: Basics of Sequence similarity

What you can check to stay updated?

Biocatalogue http://www.biocatalogue.org/

EMBRACE http://www.embraceregistry.net/

Bioinformatics Links Directory http://www.bioinformatics.ca/links_directory/

Page 65: BITS: Basics of Sequence similarity

Summary Detecting sequence similarity is a cornerstone-type of analysis in bioinformatics

The identity matrix summarizes this scoring system, listing all residue combinations in a table

Substitutions or score matrices provide a means to determine similarity in an objective way

Complex substitutions matrices are more meaningful and sensitive to detect similarity

Substitution matrices are derived from analysis of multiple alignments of related sequences

The substitution matrices capture the similarity in properties between residues

A matrix does not capture insertions and deletions: penalties are given to deal with them

The two parts of the gap penalty: a higher penalty for creation, one lower for its extension

Dynamic programming uses a gap penalty and a scoring scheme to align two sequences

Needleman-Wunsch to align two sequences over the whole length (global alignment)

Smith-Waterman to align the most similar parts of two sequences (local alignment)

A graphical method: dot plots can be made to rapidly identify regions with similar sequence

Dot plots generate typical patterns which can be interpreted

Multiple sequence alignment is not simply expanding pairwise sequence alignments

Searching sequence databases is done through a little trick

BLAST finds quickly similar sequences by giving up some sensitivity

Depending on DNA and/or protein sequences as query or in the db, you choose a BLAST version

You can adjust few parameters to the BLAST algorithm

A lot of power lies within choosing the right database for the BLAST search.

Some adjustments to the BLAST protocol exist for particular purposes

BLAST2SEQ aligns 2 sequences and visualises the output in a dotplot-like graphh

BLAT is derived from BLAST, and is used for searching very similar sequences in a genome

The technique of indexing 'words' is also used in some short read aligners

FastA is another popular sequence database search algorithm

FastA is accessible on the website of EBI

The interpretation of FastA output is similar as for BLAST.

Similarity you observe, homology you infer

Homology: orthologous and paralogous (in- and out-)