Top Banner
PATTERNS, MOTIFS BLOCKS AND PSI-BLAST Shifra Ben-Dor Irit Orr APPLICATIONS OF MULTIPLE ALIGNMENT
76

APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Oct 07, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

PATTERNS, MOTIFSBLOCKS AND PSI-BLAST

Shifra Ben-Dor Irit Orr

APPLICATIONS OFMULTIPLE ALIGNMENT

Page 2: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

PAIRWISE ALIGNMENT

DATABASE SEARCHING

MULTIPLE ALIGNMENT

Page 3: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

MULTIPLE ALIGNMENT

PhylogeneticAnalysis

HomologyModeling

Advanced Database Searches,Patterns, Motifs, Promoters

Page 4: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Why run similarity searches?

■ Similarity searches of databases are used inorder to:

■ Gain knowledge and understanding of a geneor protein, in terms of evolution, structure orfunction.

■ Try to find homologous sequences, wherehomologous sequences mean that thosesequences are derived from commonancestry.

Page 5: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

■ The tools used for similarity searches (e.gblast, fasta) are known to miss 10% - 20 % of“true hits”. This “area” of similarity is knownas the “twilight zone”

■ The proportion of missed similarities are evengreater when searching modular proteins(that are composed of several, smalldomains.)

■ So, other tools are needed for these specificsearches.

Database searching doesn’talways find what we want…..

Page 6: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Database searching doesn’talways find what we want…..

Even using advanced methods such asSmith-Waterman and Framesearch, more distantly related, though biologicallyrelevant sequences are often missed, dueto the requirement for high sequence similarity.

Page 7: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Biologically relevantsequences that are hard to find

■ Proteins with several similar, shortregions of similarityaaa…..............bbb…………...…..ccc

aaa……….bbb…………ccc■ Proteins with extended motifs

GV (X20) C (X30) C■ Proteins with ‘inexact’ motifs (structural,

electrostatic, hydrophobic/philic motifs)

Page 8: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

How do we usually find them?

Historically, these protein families were found by looking for functionally related sequences, either in the same speciesor in others.

The similarities can also be seen by performing multiple alignments on the more distantly related sequences that wecan find.

Page 9: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

What we would like to do

is harness the power of

multiple alignments to help us

in our database searches.

The bottom line …...

Page 10: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

New tools are needed for thesesimilarity specific searches, based onthe knowledge gained from multiplealignments, (eg protein families).

Tools like motifs, patterns, blocks orprofiles searches can help. These toolsuse family information to improve thesensitivity to distant family members(homologs).

Page 11: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

The Old Method: ScoringMatrices

■ Most database search methods,pairwise, and multiple alignments usepreviously derived matrices (such asPAM or BLOSUM) for scoring thechange of an amino acid or nucleotide.

■ These matrices are based on knownprotein families and the probabilitiesdrawn from them are generalized for allsequences

Page 12: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Scoring Matrices

■ The PAM family (Dayhoff) is based onevolutionary distance. The matriceswere derived from closely relatedsequences and the mutations seen inthem.

■ The Blosum family (Henikoff andHenikoff) were derived from moredistantly related sequences. Thenumber of the matrix is percent identity.

Page 13: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

New Method:

• What we’d like to do is derive a scoringmatrix from our specific family ofsequences

• This takes into account which positionsare absolutely unchangeable, which aremore flexible, and is not a generalizedscore based on all proteins available,but just those that are relevant to aspecific family of proteins

Page 14: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Terminology

■ Motif

■ Pattern

■ Profile

Page 15: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Terminology: Motif

■ Motif - small conserved region within alarge sequence.

■ Also called domains

■ Two types:– functional, no relation to context

(SH2, glycosylation)– Indications of family relationship

(cytokine receptor superfamily)

Page 16: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Terminology: Pattern

■ Pattern (1)- small motifs

■ Pattern (2)- a region containing severalmotifs and can also contain gaps.

Page 17: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Terminology: Profile

■ Profile - position specific matrix builtfrom multiple alignment of group ofsequences.

■ Different tools are used for each of theabove.

Page 18: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

We can search databases made of motifsand profiles, or use motifs and profiles to search sequence databases, and in somecases use profiles and motifs to search profile and motif databases.

Page 19: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Search with: fixed expressions

For example: SHIFRA or IRIT

Advantages: simple, fast searching, Can reduce noise of non-conserved residues

Problems: 1) Demands exact match, no provision for similarity (conservative change)2) Only some of the information contained in a given protein or domain is used3) An exact match is not necessarily a true hit, there is no context.

Page 20: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Search with: Patterns

For example: C-X{1,13}-C-[IVML] [ST]-H-[IVML]-[FYW]-[RK]-A

Advantages: - More information, more likely to find distant matches

Disadvantages:

- More “noise”, may add irrelevant sequences- some context, but still demandsexact matches

Page 21: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Size is important….. IRIT swissprot 713 SARA swissprot 1,797 AVIAD swissprot 185 RACHEL swissprot 0 SHIFRA swissprot 2

IRIT uniprot 10,143 SARA uniprot 36,088 AVIAD uniprot 2,591 RACHEL uniprot 26 SHIFRA uniprot 19

[ST]-H-[IVML]-[FYW]-[RK]-A in swissprot 29[ST]-H-[IVML]-[FYW]-[RK]-A in uniprot 395

Page 22: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Search with: Profiles

Example: PSSM

Advantages: 1) Profile searchesinclude maximal information.2) Use most rigorous algorithms

Problems: 1) Slow searches due torigor. Demands powerful computer,lots of computer time2) If a mistake enters the profile, mayend up with irrelevant data.

Page 23: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Position Specific Scoring MatrixPSSM

■ Specific for each family of sequences

■ A matrix of vectors of the size 20 x thesequence length

■ Many methods exist for deriving them

D K L W S E W S

Page 24: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Advantages of PSSM

■ Weights sequence according toobserved diversity specific to the familyof interest

■ Minimal assumptions■ Easy to compute■ Can be used in comprehensive

evaluationsHenikoff and Henikoff (1994) J.Mol Biol. 243:574-578

Page 25: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Position Specific Scoring Matrix

• PSSM can be used to search againstsequence or a group of sequences (db) for thelocation(s) of motif(s) represented by thePSSM.

• It is important that the PSSM will represent asbest as is possible the expected motifs (sites).

• When producing a PSSM, the larger thenumber of the sequences in the alignment, thegreater guarantee that the PSSM will have thebest representation of the motif.

Page 26: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Position Specific Scoring Matrix

• If the dataset used in building the PSSMis small, then unless the motif hasalmost identical AA in each column, thecolumn frequencies in the motif may notbe highly representative of all otheroccurrences of characters in the motif.

• This means we may miss true hits

Page 27: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

What we expect from Motif orPattern Analysis Tools

Identification of very distant homologs.

May point to important functional units

in a sequence

Can be used to “anchor” or break-up a

multiple alignment.

Database of motifs can be used to

develop other informatics application.

Page 28: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Steps of search To be considered

Initial similarity search of a queryAgainst sequence database

Multiple alignment of the hits

Derivation of a motif/pattern/profile

Pattern/profile database search

Borders of the query segment.Scoring matrices.Gap penalties.Filters.Choice of the databases.

Weighting sequencesWeighting positions

Page 29: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Motif Databases

■ There are motifs databases such asProsite, Prints, Sbase, etc.

■ Functional sites of Protein families arestored in these databases. Usually thedatabase provides an excellentdescription for the motifs.

Page 30: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

DNA Motifs

■ TransFac Database - is a database ofeukaryotic cis-acting regulatory DNAelements and trans-acting factors.It covers the whole range from yeast tohuman.

■ TransFac contains the following datatypes: Sites, Consensus patterns, andMatrices

http://www.gene-regulation.com

Page 31: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

ProSite Database of Proteinsfamilies & domains

PROSITE is a method of determining what is thefunction of uncharacterized proteins translatedfrom genomic or cDNA sequences.

PROSITE database consists of biologicallysignificant sites and patterns formulated in sucha way that with appropriate computational tools itcan rapidly and reliably identify to which knownfamily of proteins, (if any), the new sequencebelongs.

Page 32: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Prosite Database

• In some cases the sequence of an unknownprotein is too distantly related to any protein ofknown structure, and it’s resemblance isundetectable by overall sequence alignment.

• However, it can be identified by the occurrenceof a particular pattern in its sequence.

• This pattern can be one of the following types:pattern, motif, signature, or fingerprint.

• These motifs usually point to a function.

Page 33: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

ProSite Database of Proteins families& domains

Citation:

Hofmann K., Bucher P., Falquet L., Bairoch A.

The PROSITE database, its status in 1999

Nucleic Acids Res. 27:215-219(1999).

Page 34: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Prints Database

PRINTS is a compendium of protein fingerprints.A fingerprint is a group of conserved motifs usedto characterise a protein family.

Usually the motifs do not overlap, but areseparated along a sequence, though they maybe contiguous in 3D-space.

Fingerprints are observed in sequence alignments;taken together, the motifs characterise thealigned family and hence provide a specificdiagnostic signature.

Page 35: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Prints Database

Fingerprints thus derive much of their potencyfrom the biological context afforded bymatching multiple motifs; this makes them atonce more flexible and more powerful thansingle-motif approaches.

The technique further departs from otherpattern-matching methods by readily allowingthe creation of discriminators at super-family,family and sub-family-specific levels.

Page 36: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Prints Database

Paper:

PRINTS and PRINTS-S shed light on proteinancestry

T. K. Attwood, M. J. Blythe, D. R. Flower, A. Gaulton, J. E.Mabey, N. Maudling, L. McGregor, A. L. Mitchell, G.Moulton, K. Paine and P. Scordis

Nucleic Acids Research, 2002, Vol. 30, No. 1 239-241

Page 37: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Prints Database

Prints, (version from 6/05) includes 1900fingerprints, encoding ~11435 motifs, covering arange of globular and membrane proteins,modular polypeptides and so on.

The PRINTS-S database models relationshipsbetween families, including those beyond thereach of conventional sequence analysisapproaches.

The database is accessible for BLAST, fingerprintand text searches at:http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/

Page 38: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

PFAM database

• Pfam is a collection of multiple proteinsalignments and HMMs.

• Pfam is devided to 2 sections:• PfamA – set of manually curated and

annotated models

• PfamB – fully automated models createfrom alignments generated by ProDOmautomatic protein clustering of SwissProt.

Page 39: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

PFAM database

• Pfam verison 19 from December 2005, has8183 families.

• The database is accessible fromhttp://www.sanger.ac.uk/Software/Pfam/

Pfam Abstract:The Pfam Protein Families DatabaseAlex Bateman, Lachlan Coin, Richard Durbin, Robert D. Finn, Volker

Hollich, Sam Griffiths-Jones, Ajay Khanna, Mhairi Marshall, SimonMoxon, Erik L. L. Sonnhammer, David J. Studholme, Corin Yeatsand Sean R. EddyNucleic Acids Research(2004) Database Issue 32 :D138-D141

Page 40: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

SMART database(Simple Modular Architecture Research Tool)

SMART is based on curated HMMs models ofmultiple proteins alignments of representativemembers of protein families found with PSI-Blast.

Once a model is created it is being used tosearch the databases for additional familymembers. When found, these additions areentered to the multiple alignment and a newHMM is built.

Page 41: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

SMART database

• Genomic SMART contains the proteomesof completely sequenced genomes.

• SMART is accessible from:

http://smart.embl-heidelberg.de/

Page 42: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

SMART database

• Abstract:• Ponting, C.P., Schultz, J.,Milpetz, F. & Bork, P.

SMART: identification and annotation of domains fromsignalling and extracellular protein sequences.

Nucleic Acids Res 1999; 27: 229-232

• Schultz, J., Milpetz, F., Bork, P. & Ponting, C.P.

SMART, a simple modular architecture research tool:Identification of signaling domains.

PNAS 1998; 95: 5857-5864

Page 43: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

TIGRFAMs database

• TIGRFAMs is a collection of protein families,featuring curated multiple sequencealignments, hidden Markov models (HMMs)and annotation, which provides a tool foridentifying functionally related proteins basedon sequence homology. Those entries whichare "equivalogs" group homologous proteinswhich are conserved with respect to function.

• TIGRFAMs models are built similarly to thosebuilt by Pfam, but are used for proteins thathave the same function.

Page 44: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

InterPro Database

InterPro databases of protein domains andfunctional sites, that combines the searchstrategies of several signature-recognitionmethods for best results.

These various methods address different sequenceanalysis problems, resulting in rather differentand, for the most part, independent databases.

Diagnostically, each method has different areas ofoptimum application owing to the differentstrengths and weaknesses of their underlyinganalysis methods.

Page 45: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

InterPro Database

InterPro (The InterPro Consortium 2001) is acollaborative project aimed at providing anintegrated layer on top of the mostcommonly used signature databases bycreating a unique, non-redundantcharacterisation of a given protein family,domain or functional site.

Page 46: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

InterPro Database

InterPro data is distributed in XML format and itis freely available under the InterProConsortium copyright.

The InterPro project home page is available athttp://www.ebi.ac.uk/interpro

The current version (15) of InterPro contains14764 entries.

Page 47: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along
Page 48: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Types of search

• Search with a sequence against motifdatabase

• Search with a pattern against asequence database

• Build your own PSSM

• Search against a database of PSSMs

• Search with a PSSM against adatabase

Page 49: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Search with a sequence to findmotifs

■ The simplest search is to use a a singlesequence, and search against adatabase of motifs.

■ This kind of search is very fast, but doesnot provide any significance estimations.

■ An example of Motifs:

ATP-binding [AG] xxxx G K [ST]

Phosphorylation site [ST] x [RK]

Page 50: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Search with a Consensus Pattern

■ A consensus pattern - a string ofcharacters, where characters at certainpositions are “conserved” and areseparated by “unimportant” positions.

■ This type of searching, where importantpositions are filtered, is successful infinding distantly related sequences.

Page 51: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Pattern length

■ The choice of pattern length is veryimportant for database searches, and itshould be chosen carefully to enable theprogram used to give the best results.

■ The use of logical operators are alsoimportant for pattern searches, becausethey can change the results.

For example : enabling the use ofmismatches in the pattern searched.

Page 52: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

FindPatterns in GCG

■ The program used for pattern searchingin GCG is findpatterns.

■ The program can be used with a singlesequence or group of sequences (e.g. adatabase).

■ By default, the program will look for aperfect match but the user can also usemismatches.

Page 53: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

FindPatterns in GCG

■ Findpatterns can read patterns from thekeyboard, or from a file.

■ You can search for several patternssimultaneously.

■ You can use consensus sequence. Forexample: FCT(V,I)x{2,10}CA

Page 54: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Blocks database and tools

■ Blocks are multiply aligned ungappedsegments corresponding to the most highlyconserved regions of proteins.

■ The Blocks web server tools are :Block Searcher, Get Blocks and BlockMaker. These are aids to detection andverification of protein sequence homology.

■ They compare a protein or DNA sequenceto a database of protein blocks, retrieveblocks, and create new blocks,respectively.

Page 55: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

The BLOCKS web server

http://blocks.fhcrc.org/

http://bioinfo.weizmann.ac.il/blocks

Refs: Henikoff S and Henikoff JG

Page 56: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

The BLOCKS Database

The blocks for the BLOCKS database aremade automatically by looking for the mosthighly conserved regions in groups ofproteins represented in the PROSITEdatabase. These blocks are thencalibrated against the SWISS-PROTdatabase to obtain a measure of thechance distribution of matches. It is thesecalibrated blocks that make up theBLOCKS database.

Page 57: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

The Blocks Searcher tool

For searching a database of blocks, the first position of the sequence is aligned with the first position of the first block, and a score for that amino acid is obtained from the profile column corresponding to that position. Scores are summed over the width of the alignment, and then the block is aligned with the next position

Page 58: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

CHSMAIKLSSEHNIPSGIANALVHGMAHPLGAFYNTPHGVANAIHNGFTALEGEIHHLTHGEKVAFVHNGLTAIPDAHHYYHGEKVAFVHSISHQVGGVYKLQHGICNSVCHSMAHKTGAVFHIPHGCANAICHSMAHKLGSQFHIPHGLANALVHAMAHQLGGYYNLPHGVCNAVVHALAHQLGGFYHLPHGVCNAVCHPMEHELSAYYDITHGVGLAIVHLMEHELSAYYDITHGVGLAI

ASDFKDELRVC

Page 59: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

The Blocks Searcher

This procedure is carried out exhaustively for all positions of the sequence for all blocks in the database, and the best alignments between a sequence and entries in the BLOCKS database are noted. If a particular block scores highly, it is possible that the sequence is related to the group of sequences the block represents.

Page 60: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

The Blocks Searcher tool

Typically, a group of proteins has more than one region in common and their relationship is represented as a series of blocks separated by unaligned regions. If a second block for a group also scores highly in the search, the evidence that the sequence is related to the group is strengthened, and is further strengthened if a third block also scores it highly, and so on.

Page 61: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

The Block Maker Tool

Block Maker finds conserved blocks in agroup of two or more unaligned proteinsequences, which are assumed to berelated.

The input file must contain at least 2sequences.

Input sequences must be in FastAformat.

Page 62: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

The Block Maker toolThis program uses different algorithms tocreate regions of local alignment. Thereare two steps to the program:

The first step finds candidate alignments.This is done using two different algorithms.

The best alignments from both methodsare passed on to the second step, analgorithm called MOTOMAT (Henikoff1991)

Page 63: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

The Block Maker tool

This algorithm extends the alignments,scores them, and then sorts them in such away that a best set ("best path") is chosen.

Motomat will not attempt to realignsequences that don’t fit - it discards them.

At the end it produces two sets of blocks,one for each of the original alignmentmethods.

Page 64: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

**BLOCKS from MOTIF**>dynamin MX3_RAT P18590 rattus norvegicus (rat). interferon... family7 sequences are included in 4 blocks

dynaminA, width = 30DYN1_HUMAN 24 GQNADLDLPQIAVVGGQSAGKSSVLENFVGMGM1_YEAST 168 SSSAHLTLPSIVVIGSQSSGKSSVLESIVG MX1_ANAPL 137 GIEKDLSLPAIAVIGDQSSGKSSILEALSG MX2_HUMAN 140 GVEQDLALPAIAVIGDQSSGKSSVLEALSG MX3_RAT 61 GVEQDLALPAIAVIGDQSSGKSSVLEALSG MX_SHEEP 58 GVEQDLALPAIAVIGDQSSGKSSVLEALSGVPS1_YEAST 27 GSQSPIDLPQITVVGSQSSGKSSVLENIVG

**BLOCKS from GIBBS**

>dynamin MX3_RAT P18590 rattus norvegicus (rat). interferon... family6 sequences are included in 6 blocks

dynaminA, width = 30DYN1_HUMAN 24 GQNADLDLPQIAVVGGQSAGKSSVLENFVG MX1_ANAPL 137 GIEKDLSLPAIAVIGDQSSGKSSILEALSG MX2_HUMAN 140 GVEQDLALPAIAVIGDQSSGKSSVLEALSG MX3_RAT 61 GVEQDLALPAIAVIGDQSSGKSSVLEALSG MX_SHEEP 58 GVEQDLALPAIAVIGDQSSGKSSVLEALSGVPS1_YEAST 27 GSQSPIDLPQITVVGSQSSGKSSVLENIVG

Page 65: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

ID ADH_IRON_1; BLOCK AC BL00913C; distance from previous block=(56,76) DE Iron-containing alcohol dehydrogenases proteins. BL HHG motif; width=22; seqs=11; 99.5%=492 strength =1428 ADHE_CLOAB ( 720) CHSMAIKLSSEHNIPSGIANAL 66FUCO_ECOLI ( 262) VHGMAHPLGAFYNTPHGVANAI 44GLDA_BACST ( 259) HNGFTALEGEIHHLTHGEKVAF 100GLDA_ECOLI ( 269) VHNGLTAIPDAHHYYHGEKVAF 100MEDH_BACMT ( 259) VHSISHQVGGVYKLQHGICNSV 78ADH1_CLOAB ( 258) CHSMAHKTGAVFHIPHGCANAI 47ADHE_ECOLI ( 721) CHSMAHKLGSQFHIPHGLANAL 47ADH2_ZYMMO ( 261) VHAMAHQLGGYYNLPHGVCNAV 36ADH4_YEAST ( 263) VHALAHQLGGFYHLPHGVCNAV 41ADHA_CLOAB ( 266) CHPMEHELSAYYDITHGVGLAI 50ADHB_CLOAB ( 266) VHLMEHELSAYYDITHGVGLAI 49

Page 66: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

COBBLERConsensus Biasing By Locally EmbeddingResidues

Computes a consensus sequence from a Block, and embeds the consensus sequence into the closest sequence from within that block

Ref: Henikoff and Henikoff (1997)Protein Science 6 : 698-705

Page 67: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

COBBLER

■ Can be used as an input sequence fordatabase searches (FastA, PSI-Blast)

■ Has the advantage of information frommultiple alignment of a protein family

■ Helps in ‘between-motif’ regions, wherethere tend to be regions with largesequence diversity

Page 68: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

PSI-BLAST

■ POSITION-SPECIFIC ITERATED

■ Runs one round of gapped-Blast, andthen builds a PSSM

■ The PSSM is used as the input for thefollowing rounds of Blast

■ Ref: Altschul et al (1997) Nucleic AcidsResearch 25 (17) 3389-3402

Page 69: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Producing the PSSM

■ PSSM equals the length of the querysequence

■ All database segments with an E scoreof less than 0.01 are taken for themultiple alignment

■ The query sequence is the template forthe alignment

■ Identical sequences are discarded

Page 70: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Producing the PSSM

■ One copy of sequences with more than98% identity to each other is used

■ Gaps are ignored in the alignment, andtreated as an independent character inthe alignment weighting (no additionalpenalty)

■ Reduce the size of the matrix per baseto only those columns that arecontained in all rows

Page 71: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Producing the PSSM

Page 72: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Producing the PSSM

■ Can have different numbers ofsequences in each row

■ Weights are calculated over the wholealignment, gaps are counted as anindependent character, Columns withidentical bases are ignored in theweight calculation

Page 73: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

Iteration

■ PSI-Blast continues until no newproteins with E-value of less than 0.01are found

■ Adds the new sequences in each roundto the PSSM

■ User has the choice to manually edit(force sequences in or out) the input tothe alignment

Page 74: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

PHI-Blast

■ Pattern Hit Initiated

■ Uses a pattern as an input sequence

■ Output can be used as an input for PSI-Blast

Page 75: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along

“PHI-BLAST helps answer the question:

What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences?

PHI-BLAST may be preferable to just searching for pattern occurrencesbecause it filters out those cases where the pattern occurrence is probably randomand not indicative of homology.”

Page 76: APPLICATIONS OF MULTIPLE ALIGNMENT · A fingerprint is a group of conserved motifs used to characterise a protein family. Usually the motifs do not overlap, but are separated along