Top Banner
Design and creation of Design and creation of multiple sequence multiple sequence alignments alignments Unit 15 Unit 15 BIOL221T BIOL221T : Advanced : Advanced Bioinformatics for Bioinformatics for Biotechnology Biotechnology Irene Gabashvili, PhD
78

Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Dec 31, 2015

Download

Documents

Aldous Farmer
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Design and creation of Design and creation of multiple sequence multiple sequence

alignmentsalignmentsUnit 15Unit 15

BIOL221TBIOL221T: Advanced : Advanced Bioinformatics for Bioinformatics for

BiotechnologyBiotechnologyIrene Gabashvili, PhD

Page 2: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

IPA 6.0 licenseIPA 6.0 license

Need a list of e-mails to create Need a list of e-mails to create accountsaccounts

Will have a 6 weeks license (instead Will have a 6 weeks license (instead of 2 weeks)of 2 weeks)

Problem Set 3 is Pathway Analysis, Problem Set 3 is Pathway Analysis, Lab of March 19 will be on using IPA Lab of March 19 will be on using IPA too too

Page 3: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Problem Set 2 ReviewProblem Set 2 Review

Sensitivity and SpecificitySensitivity and Specificity Parameters for Multiple Alignment Parameters for Multiple Alignment

(Databases, Search Terms, Scores)(Databases, Search Terms, Scores) TransfacTransfac DotplotsDotplots

Page 4: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Gene prediction Gene prediction flowchartflowchart

Page 5: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Evaluation of Splice Site Prediction

Fig 5.11Baxevanis & Ouellette 2005

What do measures really mean?

Note typo in B&O

Page 6: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ROC curves (plots of (1-Sn) ROC curves (plots of (1-Sn) vs Sp)vs Sp)

A A receiver operating characteristicreceiver operating characteristic ((ROCROC), or simply ), or simply ROC curveROC curve, is a , is a graphical plot of the plot of the sensitivity vs. (1 - vs. (1 - specificity) for a ) for a binary classifier system system as its discrimination threshold is varied.as its discrimination threshold is varied.

The sensitivity and specificity of a The sensitivity and specificity of a diagnostic test depends on more than diagnostic test depends on more than just the "quality" of the test--they also just the "quality" of the test--they also depend on the definition of what depend on the definition of what constitutes an abnormal test.constitutes an abnormal test.

Page 7: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Evaluation of Splice Site Prediction

• Normalized specificity:

1

1

ActualTrue False

PP=TP+FP

PN=FN+TN

AP=TP+FNAN=FP+TN

PredictedTrue

False TNFN

FPTP

• Specificity: rAN

AP

• Misclassification rates: FN

AP

FP

AN

• Sensitivity: = Coverage

Page 8: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Careful: different definitions for "Specificity"

ActualTrue False

PP=TP+FP

PN=FN+TN

AP=TP+FNAN=FP+TN

PredictedTrue

False TNFN

FPTP

• Specificity:

cf. Guig�ó definitions Sn: Sensitivity = TP/(TP+FN)

Sp: Specificity = TN/(TN+FP) = Sp-

AC: Approximate Coefficient = 0.5 x ((TP/(TP+FN)) + (TP/(TP+FP)) + (TN/(TN+FP)) + (TN/(TN+FN))) - 1

Other measures? Predictive Values, Correlation Coefficient

Brendel definitions

Page 9: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

9

Best measures for comparing different methods?

• ROC curves (Receiver Operating Characteristic?!!)

http://www.anaesthetist.com/mnm/stats/roc/

"The Magnificent ROC" - has fun applets & quotes:

"There is no statistical test, however intuitive and simple, which will not be abused by medical researchers"

• Correlation Coefficient(Matthews correlation coefficient (MCC)

MCC = 1 for a perfect prediction 0 for a completely random assignment

-1 for a "perfectly incorrect" prediction

Just FYI

Page 10: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

10

PromotersPromotersWhat signals are there?What signals are there?

Simple ones in prokaryotesSimple ones in prokaryotes

Page 11: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Prokaryotic promoters Prokaryotic promoters RNA polymerase complexRNA polymerase complex recognizes recognizes

promoter sequences located very close to & promoter sequences located very close to & on 5’ side (“upstream”) of initiation site on 5’ side (“upstream”) of initiation site

RNA polymerase complexRNA polymerase complex binds directlybinds directly to to these. with no requirement for “transcription these. with no requirement for “transcription factors”factors”

Prokaryotic promoter sequences are highly Prokaryotic promoter sequences are highly conservedconserved

-10 region -10 region -35 region-35 region

Page 12: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Simpler view of complex promoters in eukaryotes:

Fig 5.12Baxevanis & Ouellette 2005

Page 13: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

13

Eukaryotic genes are transcribed by Eukaryotic genes are transcribed by 3 different RNA polymerases3 different RNA polymerases

Recognize different types of promoters & enhancers:

Page 14: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

14

Eukaryotic promoters & Eukaryotic promoters & enhancers enhancers

PromotersPromoters located “relatively” close to initiation located “relatively” close to initiation sitesite

(but can be located within gene, rather than upstream!)(but can be located within gene, rather than upstream!)

Enhancers Enhancers also required for regulated transcriptionalso required for regulated transcription(these control expression in specific cell types, developmental stages, in (these control expression in specific cell types, developmental stages, in response to environment)response to environment)

RNA polymerase complexes do notRNA polymerase complexes do not specifically specifically recognize promoter sequences directlyrecognize promoter sequences directly

TTranscription factorsranscription factors bind first and serve as bind first and serve as “landmarks” for recognition by RNA polymerase “landmarks” for recognition by RNA polymerase complexescomplexes

Page 15: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

15

Eukaryotic transcription Eukaryotic transcription factors factors

Transcription factorsTranscription factors (TFs) are DNA binding (TFs) are DNA binding proteins that also interact with RNA polymerase proteins that also interact with RNA polymerase complex to activate or repress transcriptioncomplex to activate or repress transcription

TFs contain characteristic TFs contain characteristic “DNA binding “DNA binding motifs”motifs”

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.table.7039

TFs recognize specific short DNA sequence TFs recognize specific short DNA sequence motifs motifs “transcription factor binding sites”“transcription factor binding sites”

Several databases for these, e.g.Several databases for these, e.g. TRANSFAC http://www.generegulation.com/cgibin/pub/databases/transfac

Page 16: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Zinc finger-containing Zinc finger-containing transcription factors transcription factors

• Common in eukaryotic proteins

• Estimated 1% of mammalian genes encode zinc-finger proteins

• In C. elegans, there are 500!

• Can be used as highly specific DNA binding modules

• Potentially valuable tools for directed genome modification (esp. in plants) & human gene therapy

Page 17: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Promoter prediction: Eukaryotes vs Promoter prediction: Eukaryotes vs prokaryotesprokaryotes

Promoter prediction is easier in microbial genomes

Why? Highly conservedSimpler gene structuresMore sequenced genomes!

(for comparative approaches)

Methods? Previously: mostly HMM-based Now: similarity-based. comparative

methodsbecause so many genomes

available

Page 18: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

18

Predicting promoters: Steps & Predicting promoters: Steps & StrategiesStrategies

Closely related to gene prediction! • Obtain genomic sequence• Use sequence-similarity based comparison

(BLAST, MSA) to find related genesBut: "regulatory" regions are much less well-conserved than coding regions

• Locate ORFs • Identify TSS (if possible!)• Use promoter prediction programs • Analyze motifs, etc. in sequence (TRANSFAC)

Page 19: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Predicting promoters: Steps & Predicting promoters: Steps & StrategiesStrategies

Identify TSS --if possible?• One of biggest problems is determining exact TSS!

Not very many full-length cDNAs!• Good starting point? (human & vertebrate genes)

Use FirstEFfound within UCSC Genome Browseror submit to FirstEF web server

Fig 5.10Baxevanis & Ouellette 2005

Page 20: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Automated promoter prediction Automated promoter prediction strategiesstrategies

1)Pattern-driven algorithms

2)Sequence-driven algorithms

3)Combined "evidence-based"

BEST RESULTS? Combined, sequential

Page 21: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Promoter Prediction: Pattern-driven Promoter Prediction: Pattern-driven algorithmsalgorithms

• Success depends on availability of collections of annotated binding sites (TRANSFAC & PROMO)

• Tend to produce huge numbers of FPs

• Why? • Binding sites (BS) for specific TFs often variable• Binding sites are short (typically 5-15 bp)• Interactions between TFs (& other proteins) influence affinity &

specificity of TF binding • One binding site often recognized by multiple BFs • Biology is complex: promoters often specific to

organism/cell/stage/environmental condition

Page 22: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Promoter Prediction: Pattern-driven Promoter Prediction: Pattern-driven algorithmsalgorithms

Solutions to problem of too many FP predictions?

• Take sequence context/biology into account • Eukaryotes: clusters of TFBSs are common

• Prokaryotes: knowledge of factors helps• Probability of "real" binding site increases if annotated

transcription start site (TSS) nearby • But: What about enhancers? (no TSS nearby!)

& Only a small fraction of TSSs have been experimentally mapped

• Do the wet lab experiments! • But: Promoter-bashing is tedious

Page 23: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Promoter Prediction: Sequence-driven Promoter Prediction: Sequence-driven algorithmsalgorithms

• Assumption: common functionality can be deduced from sequence conservation• Alignments of co-regulated genes should highlight elements

involved in regulationCareful: How determine co-regulation?

• Orthologous genes from difference species• Genes experimentally determined to be

co-regulated (using microarrays??)• Comparative promoter prediction:

"Phylogenetic footprinting" - more later….

Page 24: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Problems:• Need sets of co-regulated genes• For comparative (phylogenetic) methods

• Must choose appropriate species• Different genomes evolve at different rates• Classical alignment methods have trouble with translocations, inversions in order of functional

elements• If background conservation of entire region is highly

conserved, comparison is useless• Not enough data (Prokaryotes >>> Eukaryotes)

• Biology is complex: many (most?) regulatory elements are not conserved across species!

Promoter Prediction: Sequence-driven Promoter Prediction: Sequence-driven algorithmsalgorithms

Page 25: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Examples of promoter Examples of promoter prediction/characterization prediction/characterization

softwaresoftwareLab: used MATCH, MatInspector

TRANSFACMEME & MASTBLAST, etc.

Others?FIRST EFDragon Promoter Finder

also see Dragon Genome Explorer (has specialized promoter software for GC-rich DNA, finding CpG islands, etc)JASPAR

Page 26: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

TRANSFAC matrix entry: for TRANSFAC matrix entry: for TATA TATA boxbox

Fields:• Accession & ID •Brief description•TFs associated with this entry•Weight matrix •Number of sites used to build (How many here?)•Other info

Fig 5.13Baxevanis & Ouellette 2005

Page 27: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Global alignment of human & mouse Global alignment of human & mouse obese gene promoters (200 bp obese gene promoters (200 bp

upstream from TSS)upstream from TSS)

Fig 5.14Baxevanis & Ouellette 2005

Page 28: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

GenBank IDs and GenBank IDs and AccessionsAccessions

http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/RefSeq/key.html#accessions RefSeq/key.html#accessions (Accession Formats: RefSeq)(Accession Formats: RefSeq)

http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html Sitemap/samplerecord.html (GenBank Sample Record)(GenBank Sample Record)

Page 29: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Why we do multiple alignments?Why we do multiple alignments?

– Help prediction of the secondary and tertiary Help prediction of the secondary and tertiary structures of new sequences;structures of new sequences;

– Preliminary step in molecular evolution Preliminary step in molecular evolution analysis using Phylogenetic methods for analysis using Phylogenetic methods for constructing phylogenetic trees.constructing phylogenetic trees.

Page 30: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

An example of Multiple An example of Multiple AlignmentAlignment

VTISCTGSSSNIGAG-NHVKWYQQLPGQLPGVTISCTGTSSNIGS--ITVNWYQQLPGQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Page 31: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Visualization exampleVisualization example

Page 32: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Other multiple alignment Other multiple alignment programsprograms

ClustalW / ClustalX

pileup

multalign

multal

saga

hmmt

DIALIGN

SBpima

MLpima

T-Coffee

...

Page 33: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Other multiple alignment Other multiple alignment programsprograms

ClustalW / ClustalX

pileup

multalign

multal

saga

hmmt

DIALIGN

SBpima

MLpima

T-Coffee

...

Page 34: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ClustalW- for multiple ClustalW- for multiple alignmentalignment

ClustalW can create multiple alignments, ClustalW can create multiple alignments, manipulate existing alignments, do manipulate existing alignments, do profile analysis and create phylogentic profile analysis and create phylogentic trees.trees.

Alignment can be done by 2 methods:Alignment can be done by 2 methods:- slow/accurate - slow/accurate

- fast/approximate- fast/approximate

Page 35: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Running ClustalW Running ClustalW [~]% clustalw

************************************************************** ******** CLUSTAL W (1.7) Multiple Sequence Alignments ******** **************************************************************

1. Sequence Input From Disc 2. Multiple Alignments 3. Profile / Structure Alignments 4. Phylogenetic trees

S. Execute a system command H. HELP X. EXIT (leave program)

Your choice:

Page 36: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Running ClustalWRunning ClustalW

The input file for clustalW is a file containing all sequences in one of the following formats:NBRF/PIR, EMBL/SwissProt, Pearson (Fasta),GDE, Clustal, GCG/MSF, RSF.

Page 37: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Using ClustalWUsing ClustalW****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file

4. Toggle Slow/Fast pairwise alignments = SLOW

5. Pairwise alignment parameters 6. Multiple alignment parameters

7. Reset gaps between alignments? = OFF 8. Toggle screen display = ON 9. Output format options

S. Execute a system command H. HELP or press [RETURN] to go back to main menu

Your choice:

Page 38: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Output of ClustalWOutput of ClustalWCLUSTAL W (1.7) multiple sequence alignment

HSTNFR GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAGSYNTNFTRP GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAGCFTNFA -------------------------------------------TGTCCAG------ACAGCATTNFAA GGGAAGAG---CTCCCACATGGCCTGCAACTAATCAACCCTCTGCCCCAG------ACACRABTNFM AGGAGGAAGAGTCCCCAAACAACCTCCATCTAGTCAACCCTGTGGCCCAGATGGTCACCCRNTNFAA AGGAGGAGAAGTTCCCAAATGGGCTCCCTCTCATCAGTTCCATGGCCCAGACCCTCACACOATNFA1 GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACACOATNFAR GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACACBSPTNFA GGGAAGAGCAGTCCCCAGGTGGCCCCTCCATCAACAGCCCTCTGGTTCAA------ACACCEU14683 GGGAAGAGCAATCCCCAACTGGCCTCTCCATCAACAGCCCTCTGGTTCAG------ACCC ** *

Page 39: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ClustalW optionsClustalW optionsYour choice: 5 ********* PAIRWISE ALIGNMENT PARAMETERS ********* Slow/Accurate alignments:

1. Gap Open Penalty :15.00 2. Gap Extension Penalty :6.66 3. Protein weight matrix :BLOSUM30 4. DNA weight matrix :IUB

Fast/Approximate alignments:

5. Gap penalty :5 6. K-tuple (word) size :2 7. No. of top diagonals :4 8. Window size :4

9. Toggle Slow/Fast pairwise alignments = SLOW

H. HELPEnter number (or [RETURN] to exit):

Page 40: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ClustalW optionsClustalW optionsYour choice: 6

********* MULTIPLE ALIGNMENT PARAMETERS *********

1. Gap Opening Penalty :15.00 2. Gap Extension Penalty :6.66 3. Delay divergent sequences :40 %

4. DNA Transitions Weight :0.50

5. Protein weight matrix :BLOSUM series 6. DNA weight matrix :IUB 7. Use negative matrix :OFF

8. Protein Gap Parameters

H. HELP

Enter number (or [RETURN] to exit):

Page 41: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Blocks database and toolsBlocks database and tools

Blocks are multiply aligned ungapped Blocks are multiply aligned ungapped segments corresponding to the most highly segments corresponding to the most highly conserved regions of proteins.conserved regions of proteins.

The Blocks web server tools are : The Blocks web server tools are : Block Searcher, Get Blocks and Block Block Searcher, Get Blocks and Block Maker. These are aids to detection and Maker. These are aids to detection and verification of protein sequence homology.verification of protein sequence homology.

They compare a protein or DNA sequence They compare a protein or DNA sequence to a database of protein blocks, retrieve to a database of protein blocks, retrieve blocks, and create new blocks,respectively. blocks, and create new blocks,respectively.

Page 42: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

The BLOCKS web The BLOCKS web serverserver

At URL: http://blocks.fhcrc.org/At URL: http://blocks.fhcrc.org/

The BLOCKS WWW server can be used to The BLOCKS WWW server can be used to create blocks of a group of sequences, create blocks of a group of sequences, or to compare a protein sequence to a or to compare a protein sequence to a database of blocks.database of blocks.

The Blocks Searcher tool should be used The Blocks Searcher tool should be used for multiple alignment of distantly for multiple alignment of distantly related protein sequences.related protein sequences.

Page 43: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

The Blocks Searcher The Blocks Searcher tooltool

For searching a database of blocks, the first position of the For searching a database of blocks, the first position of the sequence is aligned with the first position of the first block, sequence is aligned with the first position of the first block, and a score for that amino acid is obtained from the profile and a score for that amino acid is obtained from the profile column corresponding to that position. Scores are summed column corresponding to that position. Scores are summed over the width of the alignment, and then the block is over the width of the alignment, and then the block is aligned with the next position. aligned with the next position.

This procedure is carried out exhaustively for all positions This procedure is carried out exhaustively for all positions of the sequence for all blocks in the database, and the best of the sequence for all blocks in the database, and the best alignments between a sequence and entries in the alignments between a sequence and entries in the BLOCKS database are noted. If a particular block scores BLOCKS database are noted. If a particular block scores highly, it is possible that the sequence is related to the highly, it is possible that the sequence is related to the group of sequences the block represents. group of sequences the block represents.

Page 44: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

The Blocks Searcher toolThe Blocks Searcher tool

Typically, a group of proteins has more than one Typically, a group of proteins has more than one region in common and their relationship is region in common and their relationship is represented as a series of blocks separated by represented as a series of blocks separated by unaligned regions. If a second block for a group unaligned regions. If a second block for a group also scores highly in the search, the evidence also scores highly in the search, the evidence that the sequence is related to the group is that the sequence is related to the group is strengthened, and is further strengthened if a strengthened, and is further strengthened if a third block also scores it highly, and so on. third block also scores it highly, and so on.

Page 45: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

The BLOCKS DatabaseThe BLOCKS Database

The blocks for the BLOCKS database are The blocks for the BLOCKS database are made automatically by looking for the most made automatically by looking for the most highly conserved regions in groups of highly conserved regions in groups of proteins represented in the PROSITE proteins represented in the PROSITE database. These blocks are then database. These blocks are then calibrated against the SWISS-PROT calibrated against the SWISS-PROT database to obtain a measure of the database to obtain a measure of the chance distribution of matches. It is these chance distribution of matches. It is these calibrated blocks that make up the calibrated blocks that make up the BLOCKS database.BLOCKS database.

Page 46: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

The Block Maker ToolThe Block Maker Tool

Block Maker finds conserved blocks in a Block Maker finds conserved blocks in a group of two or more unaligned protein group of two or more unaligned protein sequences, which are assumed to be sequences, which are assumed to be related, using two different algorithms.related, using two different algorithms.

Input file must contain at least 2 sequences.Input file must contain at least 2 sequences.

Input sequences must be in FastA format.Input sequences must be in FastA format.

Results are returned by e-mail.Results are returned by e-mail.

Page 47: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Progressive ApproachesProgressive Approaches

CLUSTALWCLUSTALW Perform pairwise alignmentsPerform pairwise alignments Construct a tree, joining most similar Construct a tree, joining most similar

sequences first (sequences first (guide treeguide tree)) Align sequences sequentially, using the Align sequences sequentially, using the

phylogenetic treephylogenetic tree PILEUPPILEUP

Similar to CLUSTALWSimilar to CLUSTALW Uses UPGMA to produce tree (chapter 6)Uses UPGMA to produce tree (chapter 6)

Page 48: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Clustal method

Higgins and Sharp 1988 Higgins and Sharp 1988 ref: CLUSTAL: a package for performing multiple sequence ref: CLUSTAL: a package for performing multiple sequence

alignment on a microcomputer. alignment on a microcomputer. GeneGene, , 7373, 237–244. [Medline], 237–244. [Medline]

ProgressiveProgressive alignment method alignment method

An approximation strategy (An approximation strategy (heuristic heuristic algorithmalgorithm) yields a possible ) yields a possible alignment, but not necessarily the alignment, but not necessarily the best onebest one

Page 49: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ABCD

AA BB CC DD

AA

BB 1111

CC 33 11

DD 22 22 1010

Compute the pairwise Compute the pairwise alignments for alignments for all all

against allagainst all (6 pairwise (6 pairwise alignments)alignments)

the similarities are the similarities are stored in a tablestored in a table

First step:

Page 50: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

50

AA BB CC DD

AA

BB 1111

CC 33 11

DD 22 22 1010

A

D

C

B

cluster the sequences to create cluster the sequences to create a tree (a tree (guide treeguide tree):):

•Represents the order in which Represents the order in which pairs of sequences are to be pairs of sequences are to be alignedaligned•Highly similar sequences are Highly similar sequences are neighbors in the tree neighbors in the tree •Highly distant sequences are Highly distant sequences are distant from each other in the treedistant from each other in the tree

Second step:

Page 51: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

A

D

C

B

Align most similar Align most similar pairspairs

Align the alignments as Align the alignments as if each of them was a if each of them was a single sequence (with single sequence (with the use of a consensus the use of a consensus sequence or a profile)sequence or a profile)

Third step:

Page 52: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

52

Clustal programs

ClustalVClustalV ClustalClustalWW

Thompson et al., 1994 Thompson et al., 1994 Uses: sequence weighting, positions-Uses: sequence weighting, positions-

specific gap penalties and weight specific gap penalties and weight matrix choicematrix choice

W stands for weight sequences W stands for weight sequences clustalclustalXX - windows implementation - windows implementation

Page 53: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

53

ClustalW method rules (1)

sequence weighting Each sequence is weighted Each sequence is weighted

according to how different it is from according to how different it is from the other sequences. the other sequences. For the case where one specific For the case where one specific

subfamily is overrepresented in the subfamily is overrepresented in the datadata

Page 54: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

54

ClustalW method rules (2)

weight matrix choice

The substitution matrix used for The substitution matrix used for each alignment step depends on the each alignment step depends on the similarity of the sequences. similarity of the sequences.

Page 55: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

55

ClustalW method rules (3)

positions-specific gap penalties

Gaps found in initial alignments Gaps found in initial alignments remain fixed through the process remain fixed through the process (ends gap)(ends gap)

Hydrophobic residues have higher Hydrophobic residues have higher gap penalties than hydrophilicgap penalties than hydrophilic they are more likely to be in the they are more likely to be in the

hydrophobic core, where gaps hydrophobic core, where gaps should not occur. should not occur.

Page 56: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

56

ClustalW method shortcomings

(1) (1) Sequences that are similar Sequences that are similar only in only in sub- regions sub- regions

ClustalW forces a global alignments, not local. ClustalW forces a global alignments, not local.

(2) (2) A sequence that contains a A sequence that contains a large large insertion/deletion compared insertion/deletion compared to the rest to the rest will extremely affect will extremely affect the alignment the alignment

(again global not local).(again global not local).

Page 57: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ClustalW method shortcomings

(3) (3) A sequence that contains a A sequence that contains a repetitive repetitive element (such as a domain), element (such as a domain), whereas whereas all other sequences all other sequences only contain one only contain one copy.copy.

Page 58: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Comments Pairwise alignment is an Pairwise alignment is an optimaloptimal

algorithmalgorithm

Multiple alignment is Multiple alignment is not an optimal not an optimal algorithm – only a heuristic. Better algorithm – only a heuristic. Better alignments may exist!alignments may exist!

The algorithm yields a possible alignment, The algorithm yields a possible alignment, but not necessarily the best one.but not necessarily the best one.

Page 59: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ClustalW in the web server

Global multiple sequence alignment Global multiple sequence alignment program for DNA or proteins program for DNA or proteins

Available from a number of sitesAvailable from a number of sites EMBL-EBIEMBL-EBI

Page 60: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ResultsResults

Page 61: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

61

Results

Page 62: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Alignment with colors

identity similarty

Page 63: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

CLUSTAL format

CLUSTAL W(1.82) multiple sequence alignmentCLUSTAL W(1.82) multiple sequence alignment

YPK1 SQLSWKRLLMKGYIPPYKPAVSD-Q--NSMDTSNFDEEFTR--SEKPIDSVVDEYLSESVYPK1 SQLSWKRLLMKGYIPPYKPAVSD-Q--NSMDTSNFDEEFTR--SEKPIDSVVDEYLSESVYPK2 KDISWKKLLLKGYIPPYKPIVKDTQ--SEIDTANFDQEFTK---EKPIDSVVDEYLSASIYPK2 KDISWKKLLLKGYIPPYKPIVKDTQ--SEIDTANFDQEFTK---EKPIDSVVDEYLSASIKPCA_HUMAN RRIDWEKLENREIQPPFKPKVC------GKGAENFDKFFTR---GQPVLTPPDQLVIANIKPCA_HUMAN RRIDWEKLENREIQPPFKPKVC------GKGAENFDKFFTR---GQPVLTPPDQLVIANIKPCZ_HUMAN RSIDWDLLEKKQALPPFQPQIT---M-DDYGLDNFDTQFTS---EPVQLTPDDEDAIKRIKPCZ_HUMAN RSIDWDLLEKKQALPPFQPQIT---M-DDYGLDNFDTQFTS---EPVQLTPDDEDAIKRIKAPA KEVVWEKLLSRNIETPYEPPIQ----QGQGDTSQFDKYPE----EDINYGVQGEDPYADLKAPA KEVVWEKLLSRNIETPYEPPIQ----QGQGDTSQFDKYPE----EDINYGVQGEDPYADLKAPC NEVIWEKLLARYIETPYEPPIQ----QGQGDTSQFDRYPE-EVDEEFNYGIQGEDPYMDLKAPC NEVIWEKLLARYIETPYEPPIQ----QGQGDTSQFDRYPE-EVDEEFNYGIQGEDPYMDLKAPB SEVVWERLLAKDIETPYEPPIT----SGIGDTSLFDQYPE-DV-EQLDYGIQGDDPYAEYKAPB SEVVWERLLAKDIETPYEPPIT----SGIGDTSLFDQYPE-DV-EQLDYGIQGDDPYAEYKS6_HUMAN RHINWEELLARKVEPPFKPLLQ-----SEEDVSQFDSKFTR-V-QTPVDSP-DDSTLSESKS6_HUMAN RHINWEELLARKVEPPFKPLLQ-----SEEDVSQFDSKFTR-V-QTPVDSP-DDSTLSES

* *. * *.

YPK1 -----MQKQFYPK1 -----MQKQFYPK2 ----N-QKQFYPK2 ----N-QKQFKPCA_HUMAN D--O--QSDFKPCA_HUMAN D--O--QSDFKPCZ_HUMAN D-----QSEFKPCZ_HUMAN D-----QSEFKAPA -D----FRDFKAPA -D----FRDFKAPC -D----MKEFKAPC -D----MKEFKAPB --P---FQDFKAPB --P---FQDFKS6_HUMAN A-----NQVFKS6_HUMAN A-----NQVF

Page 64: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ClustalW at EMBL - Jalview

conservation

Jalview is a multiple alignment editor

Page 65: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Jalview

color menu:color menu: TaylorTaylor colorscolors (each amino acid is colored (each amino acid is colored

differently)differently) Zappo colorsZappo colors (amino acids are colored (amino acids are colored

according to their physico-chemical according to their physico-chemical properties)properties)

Hydrophobicity colorsHydrophobicity colors (colors amino aids (colors amino aids according to a certain score scale that according to a certain score scale that represents hydrophobicity)represents hydrophobicity)

Coloring residues above a percentage Coloring residues above a percentage identity thresholdidentity threshold

User defined color schemesUser defined color schemes

Page 66: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Example - Zappo colors

physico-chemical properties color-physico-chemical properties color-code:code:

Page 67: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

67

Guide Tree

Page 68: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

68

ClustalX

ClustalX provides a window-based ClustalX provides a window-based user interface to the ClustalW user interface to the ClustalW program.program.

It uses the developed by the NCBI as It uses the developed by the NCBI as

part of their part of their NCBI SOFTWARE NCBI SOFTWARE DEVELOPEMENT TOOLKIT.DEVELOPEMENT TOOLKIT.

Page 69: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

69

T-coffee

Another MSA program Another MSA program Protein & nucleotide MSA programProtein & nucleotide MSA program Uses principles similar to ClustalWUses principles similar to ClustalW More accurate but longer running More accurate but longer running

timestimes Limits the number of sequences it Limits the number of sequences it

can align (~100)can align (~100) T-coffee at EMBnetT-coffee at EMBnet

Page 70: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

70

Page 71: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

71

T-coffee results

Page 72: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

72

Phylip format 5 995 99

Cabd_199509 PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGKWKPKIIGGICabd_199509 PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMNLPGKWKPKIIGGIJCSB1_199401 PQITLWQRPIVTIKIGGQLKEALLDTGAD---LEEM-NPGRWKPKIIGGIJCSB1_199401 PQITLWQRPIVTIKIGGQLKEALLDTGAD---LEEM-NPGRWKPKIIGGIJCSB2_199401 PQITLWQRPVT-IK-GG-QLEALLDTGADDTL-EEI-LPGRW-PKMIGGIJCSB2_199401 PQITLWQRPVT-IK-GG-QLEALLDTGADDTL-EEI-LPGRW-PKMIGGIJCSB4_199401 PQITLWQRPVT--K-GG-LKEALLDTGADDTE-----DPGRWKPKMIGGIJCSB4_199401 PQITLWQRPVT--K-GG-LKEALLDTGADDTE-----DPGRWKPKMIGGIJCSB5_199401 PQITLWQRPIVTIKVGGQLKEALLDTGADDTVL-EMNLPGRWKPKMIGGIJCSB5_199401 PQITLWQRPIVTIKVGGQLKEALLDTGADDTVL-EMNLPGRWKPKMIGGI

GGFIKVRQYDQVPIEICGHKAIGTVLVGPTPSNIIGRNLLTQLGCTLNFGGFIKVRQYDQVPIEICGHKAIGTVLVGPTPSNIIGRNLLTQLGCTLNF GGFVKVRQYDQIPIDICGHKVIGTVL-GPTPANVIGRNLLTQIGCTLNFGGFVKVRQYDQIPIDICGHKVIGTVL-GPTPANVIGRNLLTQIGCTLNF GGFVKVR-YDQVPIEICGH--IGTVLVGPTPANIIGRNLMTQLGCTLNFGGFVKVR-YDQVPIEICGH--IGTVLVGPTPANIIGRNLMTQLGCTLNF GGFLKVRQYDQIPVEICGHKAIGTVL-GPTPANIIGRNLLTQIG-TLNFGGFLKVRQYDQIPVEICGHKAIGTVL-GPTPANIIGRNLLTQIG-TLNF GGFVKVRQYDQIPIEICGHKAIGTVLVGPTPANIVGRNLLTQIGCTLNFGGFVKVRQYDQIPIEICGHKAIGTVLVGPTPANIVGRNLLTQIGCTLNF

Page 73: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

The Biology WorkBenchThe Biology WorkBench

http://workbench.sdsc.edu/http://workbench.sdsc.edu/ http://www.ngbw.org/http://www.ngbw.org/

Nucleic Acid Sequence Tools, Nucleic Acid Sequence Tools, including BLAST, CLUSTALW, including BLAST, CLUSTALW, MFOLD, PRIMER3MFOLD, PRIMER3

Page 74: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

74

Muscle

Protein & nucleotide MSA programProtein & nucleotide MSA program Improvements in both accuracy and Improvements in both accuracy and

speedspeed exploiting a range of existing and new exploiting a range of existing and new

algorithmic techniques algorithmic techniques combination of progressive and iterative combination of progressive and iterative

alignment strategies alignment strategies details of the method details of the method web serverweb server downloads: Windows, Linux, Macdownloads: Windows, Linux, Mac

Page 75: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

75

Muscle web server

Page 76: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

76

Editing MSA There are a variety of tools that can be used to There are a variety of tools that can be used to

modify a multiple alignment (SeaView, BioEdit, modify a multiple alignment (SeaView, BioEdit, JalView)JalView)

These programs can be very useful in formatting These programs can be very useful in formatting and annotating an alignment for publication. and annotating an alignment for publication.

An editor can also be used to make modifications An editor can also be used to make modifications by hand to improve biologically significant by hand to improve biologically significant regions in a multiple alignment created by one of regions in a multiple alignment created by one of the automated alignment programs. the automated alignment programs.

Page 77: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

77

MSA approaches Progressive approach Progressive approach

CLUSTALW (CLUSTALX), PileUp, CLUSTALW (CLUSTALX), PileUp, T-COFFEE, MAFFT, MUSCLET-COFFEE, MAFFT, MUSCLE

Iterative approach: Iterative approach: Repeatedly realign subsets of Repeatedly realign subsets of sequences.sequences.

MultAlin, DiAlig, MAFFT, MultAlin, DiAlig, MAFFT, MUSCLE,ProbConsMUSCLE,ProbCons

Genetic algorithmGenetic algorithmSAGASAGA

Graph algorithm Graph algorithm POAPOA

Page 78: Design and creation of multiple sequence alignments Unit 15 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Conclusion There is no single method that There is no single method that

always generates the best alignmentalways generates the best alignment

It may thus be wise to use more than It may thus be wise to use more than one methodone method

Alignment editors can be used to Alignment editors can be used to correct the alignmentscorrect the alignments