The “Nuts and Bolts” of ‘doing’ bioinformatics with the Wisconsin Package at FSU Steve Thompson Steve Thompson Florida State University School of Computational.
Post on 19-Dec-2015
218 Views
Preview:
Transcript
The “Nuts and Bolts” of ‘doing’ The “Nuts and Bolts” of ‘doing’
bioinformatics with the bioinformatics with the
Wisconsin Package at FSUWisconsin Package at FSU
Steve ThompsonSteve Thompson
Florida State University School of Florida State University School of
Computational Science (SCS)Computational Science (SCS)
BCH 5425BCH 5425 Molecular BiologyMolecular Biology
Dr. Hong LiDr. Hong Li
February 16, 2005February 16, 2005
Given nucleotide or amino acid sequence data, Given nucleotide or amino acid sequence data,
what can we learn about biological molecules, what can we learn about biological molecules,
using the popular Accelrys Wisconsin Package?using the popular Accelrys Wisconsin Package?
But first some of my definitions, lots of overlap —But first some of my definitions, lots of overlap —
BiocomputingBiocomputing and and computational biologycomputational biology are synonyms and are synonyms and
describe the use of computers and computational techniques to describe the use of computers and computational techniques to
analyze any type of a biological system, from individual molecules analyze any type of a biological system, from individual molecules
to organisms to overall ecology.to organisms to overall ecology.
BioinformaticsBioinformatics describes using computational techniques to access, describes using computational techniques to access,
analyze, and interpret the biological information in any type of analyze, and interpret the biological information in any type of
biological database.biological database.
Sequence analysisSequence analysis is the study of molecular sequence data for the is the study of molecular sequence data for the
purpose of inferring the function, interactions, evolution, and purpose of inferring the function, interactions, evolution, and
perhaps structure of biological molecules.perhaps structure of biological molecules.
GenomicsGenomics analyzes the context of genes or complete genomes (the analyzes the context of genes or complete genomes (the
total DNA content of an organism) within the same and/or across total DNA content of an organism) within the same and/or across
different genomes.different genomes.
ProteomicsProteomics is the subdivision of genomics concerned with analyzing is the subdivision of genomics concerned with analyzing
the complete protein complement, i.e. the proteome, of organisms, the complete protein complement, i.e. the proteome, of organisms,
both within and between different organisms.both within and between different organisms.
And one way to think about the field —And one way to think about the field —
The reverse biochemistry analogy.The reverse biochemistry analogy.
Biochemists no longer have to begin a research project by Biochemists no longer have to begin a research project by
isolating and purifying massive amounts of a protein from isolating and purifying massive amounts of a protein from
its native organism in order to characterize a particular its native organism in order to characterize a particular
gene product. Rather, now scientists can amplify a gene product. Rather, now scientists can amplify a
section of some genome based on its similarity to other section of some genome based on its similarity to other
genomes, sequence that piece of DNA and, genomes, sequence that piece of DNA and, using using
sequence analysis tools, infer all sorts of functional, sequence analysis tools, infer all sorts of functional,
evolutionary, and, perhaps, structural insight into that evolutionary, and, perhaps, structural insight into that
stretch of DNA! They can then clone and express it.stretch of DNA! They can then clone and express it.
The computer and molecular databases are a The computer and molecular databases are a
necessary, integral part of this entire process.necessary, integral part of this entire process.
The exponential growth of molecular sequence The exponential growth of molecular sequence databases databases & cpu power —& cpu power —YearYear BasePairsBasePairs SequencesSequences
19821982 680338680338 606606
19831983 22740292274029 24272427
19841984 33687653368765 41754175
19851985 52044205204420 57005700
19861986 96153719615371 99789978
19871987 1551477615514776 1458414584
19881988 2380000023800000 2057920579
19891989 3476258534762585 2879128791
19901990 4917928549179285 3953339533
19911991 7194742671947426 5562755627
19921992 101008486101008486 7860878608
19931993 157152442157152442 143492143492
19941994 217102462217102462 215273215273
19951995 384939485384939485 555694555694
19961996 651972984651972984 10212111021211
19971997 11603006871160300687 17658471765847
19981998 20087617842008761784 28378972837897
19991999 38411630113841163011 48645704864570
20002000 1110106628811101066288 1010602310106023
20012001 1584992143815849921438 1497631014976310
20022002 2850799016628507990166 2231888322318883
20032003 3655336848536553368485 3096841830968418
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.htmlhttp://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
doubling time ~doubling time ~one yearone year
Another perspective on size and some organization stuff —Another perspective on size and some organization stuff —
Nucleic Acid DB’sNucleic Acid DB’s
GenBank/EMBL/DDBJGenBank/EMBL/DDBJ
all Taxonomic all Taxonomic
categories + HTC’s, categories + HTC’s,
HTG’s, & STS’sHTG’s, & STS’s
““Tags”Tags”
EST’sEST’s
GSS’sGSS’s
Amino Acid DB’sAmino Acid DB’sSWISS-PROTSWISS-PROT
TrEMBLTrEMBL
PIRPIR
PIR1PIR1
PIR2PIR2
PIR3PIR3
PIR4PIR4
NRL_3DNRL_3D
GenpeptGenpept
As of February 2005 the sequences in GenBank also include over As of February 2005 the sequences in GenBank also include over 240 complete genomes, not including viruses! Nucleic acid 240 complete genomes, not including viruses! Nucleic acid sequence databases (and TrEMBL) are split into subdivisions sequence databases (and TrEMBL) are split into subdivisions based on taxonomy (historical rankings — the Fungi and Archaea based on taxonomy (historical rankings — the Fungi and Archaea warning!). PIR is split into subdivisions based on level of warning!). PIR is split into subdivisions based on level of annotation. TrEMBL sequences are merged into SWISS-PROT annotation. TrEMBL sequences are merged into SWISS-PROT as they receive increased levels of annotation.as they receive increased levels of annotation.
So how do you access and manipulate all this data?So how do you access and manipulate all this data?Often on the InterNet over the World Wide Web:Often on the InterNet over the World Wide Web:
SiteSite URL (Uniform Resource Locator)URL (Uniform Resource Locator) ContentContent
Nat’l Center Biotech' Info'Nat’l Center Biotech' Info' http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/ databases/analysis/softwaredatabases/analysis/software
PIR/NBRFPIR/NBRF http://www-nbrf.georgetown.edu/http://www-nbrf.georgetown.edu/ protein sequence databaseprotein sequence database
IUBIO Biology ArchiveIUBIO Biology Archive http://iubio.bio.indiana.edu/http://iubio.bio.indiana.edu/ database/software archivedatabase/software archive
Univ. of MontrealUniv. of Montreal http://megasun.bch.umontreal.ca/http://megasun.bch.umontreal.ca/ database/software archivedatabase/software archive
Japan's GenomeNetJapan's GenomeNet http://www.genome.ad.jp/http://www.genome.ad.jp/ databases/analysis/softwaredatabases/analysis/software
European Mol' Bio' Lab'European Mol' Bio' Lab' http://www.embl-heidelberg.de/http://www.embl-heidelberg.de/ databases/analysis/softwaredatabases/analysis/software
European BioinformaticsEuropean Bioinformatics http://www.ebi.ac.uk/http://www.ebi.ac.uk/ databases/analysis/softwaredatabases/analysis/software
The Sanger InstituteThe Sanger Institute http://www.sanger.ac.uk/http://www.sanger.ac.uk/ databases/analysis/softwaredatabases/analysis/software
Univ. of Geneva BioWebUniv. of Geneva BioWeb http://www.expasy.ch/http://www.expasy.ch/ databases/analysis/softwaredatabases/analysis/software
ProteinDataBankProteinDataBank http://www.rcsb.org/pdb/http://www.rcsb.org/pdb/ 3D mol' structure database3D mol' structure database
Molecules to GoMolecules to Go http://molbio.info.nih.gov/cgi-bin/pdb/http://molbio.info.nih.gov/cgi-bin/pdb/ 3D protein/nuc' visualization3D protein/nuc' visualization
The Genome DataBaseThe Genome DataBase http://www.gdb.org/http://www.gdb.org/ The Human Genome ProjectThe Human Genome Project
Stanford GenomicsStanford Genomics http://genome-www.stanford.edu/http://genome-www.stanford.edu/ various genome projectsvarious genome projects
Inst. for Genomic Res’rchInst. for Genomic Res’rch http://www.tigr.org/http://www.tigr.org/ esp. microbial genome projectsesp. microbial genome projects
HIV Sequence DatabaseHIV Sequence Database http://hiv-web.lanl.gov/http://hiv-web.lanl.gov/ HIV epidemeology seq' DBHIV epidemeology seq' DB
The Tree of LifeThe Tree of Life http://tolweb.org/tree/phylogeny.htmlhttp://tolweb.org/tree/phylogeny.html overview of all phylogenyoverview of all phylogeny
Ribosomal Database Proj’Ribosomal Database Proj’ http://rdp.cme.msu.edu/index.jsphttp://rdp.cme.msu.edu/index.jsp databases/analysis/softwaredatabases/analysis/software
PUMA2 at ArgonnePUMA2 at Argonne http://compbio.mcs.anl.gov/puma2/cgi-bin/http://compbio.mcs.anl.gov/puma2/cgi-bin/ metabolic reconstructionmetabolic reconstruction
Harvard Bio' LaboratoriesHarvard Bio' Laboratories http://golgi.harvard.edu/http://golgi.harvard.edu/ nice bioinformatics links listnice bioinformatics links list
With a World Wide Web browser and tools like NCBI’s Entrez & EMBL’s SRSWith a World Wide Web browser and tools like NCBI’s Entrez & EMBL’s SRS
But ‘doing’ bioinformatics on the Web has But ‘doing’ bioinformatics on the Web has
both its pros and its cons —both its pros and its cons —
Advantages: Accesses the very latest database Advantages: Accesses the very latest database
updates. It’s fun and very fast. It can be very updates. It’s fun and very fast. It can be very
powerful and efficient, if you know what you’re doing. powerful and efficient, if you know what you’re doing.
In most cases relational links between different In most cases relational links between different
databases ease navigation, and in some cases databases ease navigation, and in some cases
neighboring concepts link similar entries.neighboring concepts link similar entries.
Disadvantages: Can be very inefficient, if you don’t Disadvantages: Can be very inefficient, if you don’t
know what you’re doing. know what you’re doing. ReformattingReformatting downloaded downloaded
sequence data is usually essential, if the sequence is sequence data is usually essential, if the sequence is
to be used in any other software. And, it’s very easy to be used in any other software. And, it’s very easy
to get lost and distracted in cyberspace!to get lost and distracted in cyberspace!
Also, problems sometimes arise with the World Wide Also, problems sometimes arise with the World Wide
Web itself, like dropped or slow connections . . . .Web itself, like dropped or slow connections . . . .
So what are the alternatives?So what are the alternatives?
Personal computer software solutions — public domain Personal computer software solutions — public domain
programs are available, but . . . a bit complicated to programs are available, but . . . a bit complicated to
install, configure, and maintain. User must be pretty install, configure, and maintain. User must be pretty
computer savvy. So, computer savvy. So,
good commercial software packages are also available, good commercial software packages are also available,
e.g. Sequencher, MacVector, DNAStar, DNAsis, etc.,e.g. Sequencher, MacVector, DNAStar, DNAsis, etc.,
but . . . license hassles, especially big expense per but . . . license hassles, especially big expense per
machine, and Internet and/or CD database access all machine, and Internet and/or CD database access all
complicate matters!complicate matters!
Therefore, UNIX server-based, non-Web Therefore, UNIX server-based, non-Web
solutions are available as an alternative.solutions are available as an alternative.Public domain solutions also exist for UNIX servers, but Public domain solutions also exist for UNIX servers, but
now a very cooperative systems manager needs to now a very cooperative systems manager needs to
maintain everything for users. So,maintain everything for users. So,
commercial products, e.g. the commercial products, e.g. the Accelrys Accelrys
GCG Wisconsin PackageGCG Wisconsin Package [a [a Pharmacopeia Co.]Pharmacopeia Co.] and the and the
SeqLab Graphical User Interface, simplify matters for SeqLab Graphical User Interface, simplify matters for
administrators and users.administrators and users. One commercial license fee One commercial license fee
for an entire institution and very fast, convenient for an entire institution and very fast, convenient
database access on local server disks. Connections database access on local server disks. Connections
from any networked terminal or workstation anywhere, from any networked terminal or workstation anywhere,
anytime!anytime!
Mendel (mendel.csit.fsu.edu) — FSU’s Mendel (mendel.csit.fsu.edu) — FSU’s
UNIX (Linux) Biocomputing Server —UNIX (Linux) Biocomputing Server —Operating systemOperating system — UNIX command line; — UNIX command line;
communications software — telnet vs. ssh; X graphics; communications software — telnet vs. ssh; X graphics;
ssh -X user@mendel.csit.fsu.edussh -X user@mendel.csit.fsu.edu
file transfer — ftp vs. scp/sftp;file transfer — ftp vs. scp/sftp;
and editors — vi, emacs, pico (or word processing and editors — vi, emacs, pico (or word processing
followed by file transfer [save as "text only!"]).followed by file transfer [save as "text only!"]).
How do I get an accountHow do I get an account — just ask me! I am the — just ask me! I am the
contact person for Mendel. It usually takes a couple of contact person for Mendel. It usually takes a couple of
days for the SCS system administrator to act on my days for the SCS system administrator to act on my
request. Anybody associated with FSU is entitled to an request. Anybody associated with FSU is entitled to an
account and there are NO fees associated with it.account and there are NO fees associated with it.
The Genetics Computer Group — The Genetics Computer Group — the Wisconsin Package for Sequence Analysis.the Wisconsin Package for Sequence Analysis.
Begun in 1982 in Oliver Smithies’ lab at the Genetics Dept. at the Begun in 1982 in Oliver Smithies’ lab at the Genetics Dept. at the
University of Wisconsin, Madison, then a private company for University of Wisconsin, Madison, then a private company for
over 10 years, then acquired by the Oxford Molecular Group over 10 years, then acquired by the Oxford Molecular Group
U.K., and now owned by Pharmacopeia U.S.A. under the new U.K., and now owned by Pharmacopeia U.S.A. under the new
name Accelrys, Inc.name Accelrys, Inc.
The suite contains almost 150 programs designed to work in a The suite contains almost 150 programs designed to work in a
"toolbox" fashion. Several simple programs used in "toolbox" fashion. Several simple programs used in
succession can lead to sophisticated results.succession can lead to sophisticated results.
Also 'internal compatibility,' i.e. once you learn to use one program, Also 'internal compatibility,' i.e. once you learn to use one program,
all programs can be run similarly, and, the output from many all programs can be run similarly, and, the output from many
programs can be used as input for other programs.programs can be used as input for other programs.
Used all over the world by more than 30,000 scientists at over 530 Used all over the world by more than 30,000 scientists at over 530
institutions in 35 countries, so learning it here will most likely be institutions in 35 countries, so learning it here will most likely be
useful anywhere else you may end up.useful anywhere else you may end up.
To answer the always perplexing GCG question — “What To answer the always perplexing GCG question — “What sequence(s)? . . . .” Specifying sequences, GCG style;sequence(s)? . . . .” Specifying sequences, GCG style;in order of increasing power and complexity:in order of increasing power and complexity:1)1) The sequence is in a local GCG format single sequence file in your UNIX The sequence is in a local GCG format single sequence file in your UNIX
account. (GCG Reformat and all From- & To- programs)account. (GCG Reformat and all From- & To- programs)
2)2) The sequence is in a local GCG database in which case you ‘point’ to it The sequence is in a local GCG database in which case you ‘point’ to it
by using any of the GCG database logical names. A colon, “by using any of the GCG database logical names. A colon, “::,” always ,” always
sets the logical name apart from either an accession number or a proper sets the logical name apart from either an accession number or a proper
identifier name or a wildcard expression and they are case insensitive.identifier name or a wildcard expression and they are case insensitive.
3)3) The sequence is in a GCG format multiple sequence file, either an MSF The sequence is in a GCG format multiple sequence file, either an MSF
(multiple sequence format) file or an RSF (rich sequence format) file. To (multiple sequence format) file or an RSF (rich sequence format) file. To
specify sequences contained in a GCG multiple sequence file, supply the specify sequences contained in a GCG multiple sequence file, supply the
file name followed by a pair of braces, “file name followed by a pair of braces, “{}{},” containing the sequence ,” containing the sequence
specification, e.g. a wildcard — {specification, e.g. a wildcard — {**}.}.
4)4) Finally, the most powerful method of specifying sequences is in a GCG Finally, the most powerful method of specifying sequences is in a GCG
“list” file. This is merely a list of other sequence specifications and can “list” file. This is merely a list of other sequence specifications and can
even contain other list files within it. The convention to use a GCG list file even contain other list files within it. The convention to use a GCG list file
in a program is to precede it with an at sign, “in a program is to precede it with an at sign, “@@.” Furthermore, attribute .” Furthermore, attribute
information within list files can specify particular sequence aspects.information within list files can specify particular sequence aspects.
This is a small example of GCG single sequence format.This is a small example of GCG single sequence format.
Always put some documentation on top, so in the futureAlways put some documentation on top, so in the future
you can figure out what it is you're dealing with! Theyou can figure out what it is you're dealing with! The
line with the two periods is converted to the checksum line.line with the two periods is converted to the checksum line.
example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..
1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA
51 GATTTAATAG CATGCGATCC CATGGGA51 GATTTAATAG CATGCGATCC CATGGGA
The first way —The first way —
‘‘Clean’ GCG format single sequence file after Clean’ GCG format single sequence file after
‘reformat’ (or any of the From… programs)‘reformat’ (or any of the From… programs)
SeqLab’s Editor mode can also SeqLab’s Editor mode can also
“Import” native GenBank format and “Import” native GenBank format and
ABI or LI-COR trace files!ABI or LI-COR trace files!
The logical terms for the second way of running the Wisconsin PackageThe logical terms for the second way of running the Wisconsin PackageSequence databases, nucleic acids:Sequence databases, nucleic acids: Sequence databases, amino acids:Sequence databases, amino acids:
GENBANKPLUSGENBANKPLUS all of GenBank plus EST and GSS subdivisionsall of GenBank plus EST and GSS subdivisions GENPEPTGENPEPT GenBank CDS translationsGenBank CDS translations
GBPGBP all of GenBank plus EST and GSS subdivisionsall of GenBank plus EST and GSS subdivisions GPGP GenBank CDS translationsGenBank CDS translations
GENBANKGENBANK all of GenBank except EST and GSS subdivisionsall of GenBank except EST and GSS subdivisions SWISSPROTPLUSSWISSPROTPLUS all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL
GBGB all of GenBank except EST and GSS subdivisionsall of GenBank except EST and GSS subdivisions SWPSWP all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL
BABA GenBank bacterial subdivisionGenBank bacterial subdivision SWISSPROTSWISSPROT all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)
BACTERIALBACTERIAL GenBank bacterial subdivisionGenBank bacterial subdivision SWSW all of Swiss-Prot (fully annotated) all of Swiss-Prot (fully annotated)
ESTEST GenBank EST (Expressed Sequence Tags) subdivisionGenBank EST (Expressed Sequence Tags) subdivision SPTREMBLSPTREMBL Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations
GSSGSS GenBank GSS (Genome Survey Sequences) subdivisionGenBank GSS (Genome Survey Sequences) subdivision SPTSPT Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations
HTCHTC GenBank High Throughput cDNAGenBank High Throughput cDNA PP all of PIR Proteinall of PIR Protein
HTGHTG GenBank High Throughput GenomicGenBank High Throughput Genomic PIRPIR all of PIR Proteinall of PIR Protein
ININ GenBank invertebrate subdivisionGenBank invertebrate subdivision PROTEINPROTEIN PIR fully annotated subdivisionPIR fully annotated subdivision
INVERTEBRATEINVERTEBRATE GenBank invertebrate subdivisionGenBank invertebrate subdivision PIR1PIR1 PIR fully annotated subdivisionPIR fully annotated subdivision
OMOM GenBank other mammalian subdivisionGenBank other mammalian subdivision PIR2PIR2 PIR preliminary subdivisionPIR preliminary subdivision
OTHERMAMMOTHERMAMM GenBank other mammalian subdivisionGenBank other mammalian subdivision PIR3PIR3 PIR unverified subdivisionPIR unverified subdivision
OVOV GenBank other vertebrate subdivision GenBank other vertebrate subdivision PIR4PIR4 PIR unencoded subdivisionPIR unencoded subdivision
OTHERVERTOTHERVERT GenBank other vertebrate subdivision GenBank other vertebrate subdivision NRL_3DNRL_3D PDB 3D protein sequencesPDB 3D protein sequences
PATPAT GenBank patent subdivision GenBank patent subdivision NRLNRL PDB 3D protein sequencesPDB 3D protein sequences
PATENTPATENT GenBank patent subdivision GenBank patent subdivision
PHPH GenBank phage subdivision GenBank phage subdivision Genome databasesGenome databases
PHAGEPHAGE GenBank phage subdivisionGenBank phage subdivision HOMOHOMO NCBI human refseqNCBI human refseq
PLPL GenBank plant subdivision GenBank plant subdivision DANIODANIO Sanger Zebrafish buildSanger Zebrafish build
PLANTPLANT GenBank plant subdivision GenBank plant subdivision
PRPR GenBank primate subdivision GenBank primate subdivision General data files:General data files:
PRIMATEPRIMATE GenBank primate subdivisionGenBank primate subdivision GENMOREDATAGENMOREDATA path to GCG optional data filespath to GCG optional data files
RORO GenBank rodent subdivisionGenBank rodent subdivision GENRUNDATAGENRUNDATA path to GCG default data files path to GCG default data files
RODENTRODENT GenBank rodent subdivisionGenBank rodent subdivision GENTRAINDATAGENTRAINDATA path to GCG training datasetspath to GCG training datasets
STSSTS GenBank (sequence tagged sites) subdivisionGenBank (sequence tagged sites) subdivision
SYSY GenBank synthetic subdivisionGenBank synthetic subdivision
SYNTHETICSYNTHETIC GenBank synthetic subdivisionGenBank synthetic subdivision
TAGSTAGS GenBank EST and GSS subdivisionsGenBank EST and GSS subdivisions
UNUN GenBank unannotated subdivisionGenBank unannotated subdivision
UNANNOTATEDUNANNOTATED GenBank unannotated subdivisionGenBank unannotated subdivision
VIVI GenBank viral subdivisionGenBank viral subdivision
VIRALVIRAL GenBank viral subdivisionGenBank viral subdivision
These are easy — These are easy — they make sense and they make sense and you’ll have a vested you’ll have a vested interest.interest.
The third way — multiple sequence The third way — multiple sequence formats — GCG MSF & RSF formatformats — GCG MSF & RSF format
The trick is to not forget the Braces and ‘wild card,’ e.g. filename{The trick is to not forget the Braces and ‘wild card,’ e.g. filename{**}!}!
!!RICH_SEQUENCE 1.0!!RICH_SEQUENCE 1.0....{{name ef1a_gialaname ef1a_gialadescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listdescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listtype PROTEINtype PROTEINlongname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}sequence-ID Q08046sequence-ID Q08046checksum 7342checksum 7342offset 23offset 23creation-date 07/11/2001 16:51:19creation-date 07/11/2001 16:51:19strand 1strand 1comments ////////////////////////////////////////////////////////////comments ////////////////////////////////////////////////////////////
!!AA_MULTIPLE_ALIGNMENT 1.0!!AA_MULTIPLE_ALIGNMENT 1.0
small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..
Name: a49171 Len: 425 Check: 537 Weight: 1.00Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00Name: a46241 Len: 274 Check: 3514 Weight: 1.00
// //////////////////////////////////////////////////// //////////////////////////////////////////////////
This is SeqLab’s native formatThis is SeqLab’s native format
And the forth way, the most powerful And the forth way, the most powerful way by far — the List File formatway by far — the List File format
An example GCG list file of many elongation An example GCG list file of many elongation
1a and Tu factors follows. As with all GCG 1a and Tu factors follows. As with all GCG
data files, two periods separate data files, two periods separate
documentation from data. ..documentation from data. ..
my-special.pepmy-special.pep begin:24begin:24 end:134end:134
SwissProt:EfTu_EcoliSwissProt:EfTu_Ecoli
Ef1a-Tu.msf{*}Ef1a-Tu.msf{*}
/usr/accounts/test/another.rsf{ef1a_*}/usr/accounts/test/another.rsf{ef1a_*}
@another.list@another.list
The ‘way’ SeqLab works!The ‘way’ SeqLab works!
LookUp, a Sequence Retrieval System (SRS) LookUp, a Sequence Retrieval System (SRS)
derivative, is used to find sequences of derivative, is used to find sequences of
interest based on interest based on text wordstext words, and database , and database
similaritysimilarity searches find sequences from searches find sequences from
locallocal GCG server databases. GCG server databases.
Advantages: Search output is a legitimate GCG list file, Advantages: Search output is a legitimate GCG list file,
appropriate input to other GCG programs; no need to appropriate input to other GCG programs; no need to
download and then reformat — it’s all GCG.download and then reformat — it’s all GCG.
Disadvantage: DB’s only as new as GCG administrator Disadvantage: DB’s only as new as GCG administrator
(me) maintains them. I update every two months to (me) maintains them. I update every two months to
coincide with NCBI’s full releases.coincide with NCBI’s full releases.
Within the GCG suite —Within the GCG suite —
Let’s build two list files with LookUp —Let’s build two list files with LookUp —
One, elongation factor 1 alpha from humans, andOne, elongation factor 1 alpha from humans, and
two, all proteins in the SwissProt database from the two, all proteins in the SwissProt database from the so-called non-crown ‘primitive’ eukaryotes. so-called non-crown ‘primitive’ eukaryotes.
I’ll use the following search strings:I’ll use the following search strings:
““elongation & factor & alphaelongation & factor & alpha” in the ” in the ““DefinitionDefinition” category and “” category and “HomoHomo” in the ” in the ““OrganismOrganism” field for the first search, and” field for the first search, and
““eukaryota ! ( fungi | metazoa | eukaryota ! ( fungi | metazoa | viridiplantae )viridiplantae )” in the “” in the “OrganismOrganism” ” category for the second search.category for the second search.
These two searches illustrate LookUp’s syntax These two searches illustrate LookUp’s syntax rules, in particular it’s Boolean qualifiers.rules, in particular it’s Boolean qualifiers.
SeqLab — GCG’s X-based GUI!SeqLab — GCG’s X-based GUI!The SeqLab graphical user interface is the The SeqLab graphical user interface is the
merger of Steve Smith’s Genetic Data merger of Steve Smith’s Genetic Data
Environment and GCG’s Wisconsin Package Environment and GCG’s Wisconsin Package
Interface:Interface:
GDE + WPI = SeqLabGDE + WPI = SeqLab
Requires an X-Windowing environment — Requires an X-Windowing environment —
either native on UNIX computers (including either native on UNIX computers (including
LINUX, but not included by Apple in Mac OS LINUX, but not included by Apple in Mac OS
X [v.10+] see Apple’s free X11 package), or X [v.10+] see Apple’s free X11 package), or
emulated with X-Server Software on other emulated with X-Server Software on other
personal computers.personal computers.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
SeqLab — Editor mode, residue display —SeqLab — Editor mode, residue display —
Structural & functional correspondence —Structural & functional correspondence —
So let’s see what it looks So let’s see what it looks like, SeqLab in action —like, SeqLab in action —
From an X ‘aware’ terminal window I From an X ‘aware’ terminal window I launch the GUI with the command:launch the GUI with the command:
seqlab &seqlab &
The ampersand is not required, but The ampersand is not required, but it allows you to continue to use the it allows you to continue to use the terminal window for system level terminal window for system level commands by running SeqLab as a commands by running SeqLab as a background process.background process.
OK then, how can we see if two OK then, how can we see if two
sequences are similar enough to belong sequences are similar enough to belong
in alignments? So first homology and in alignments? So first homology and
similarity —similarity —
Don’t confuse homology with similarity: there is Don’t confuse homology with similarity: there is
a huge difference! Similarity is a statistic that a huge difference! Similarity is a statistic that
describes how much two (sub)sequences are describes how much two (sub)sequences are
alike according to some set scoring criteria. It alike according to some set scoring criteria. It
can be normalized to ascertain statistical can be normalized to ascertain statistical
significance, but it’s still just a number.significance, but it’s still just a number.
implies an evolutionary relationship, more than just implies an evolutionary relationship, more than just
everything evolving from the same primordial ‘slime.’ To everything evolving from the same primordial ‘slime.’ To
demonstrate homology reconstruct the phylogeny of the demonstrate homology reconstruct the phylogeny of the
organisms or genes of interest. Better yet, show some organisms or genes of interest. Better yet, show some
experimental evidence — structural, morphological, experimental evidence — structural, morphological,
genetic, and/or fossil — that corroborates your assertion.genetic, and/or fossil — that corroborates your assertion.
Percent homology is an invalid concept; something is Percent homology is an invalid concept; something is
either homologous or it is not. Walter Fitch is credited either homologous or it is not. Walter Fitch is credited
with the joke “homology is like pregnancy — you can’t be with the joke “homology is like pregnancy — you can’t be
45% pregnant, just like something can’t be 45% 45% pregnant, just like something can’t be 45%
homologous.” Highly significant similarity can argue for homologous.” Highly significant similarity can argue for
homology; however, the inverse does not hold.homology; however, the inverse does not hold.
Homology, in contrast and by definition —Homology, in contrast and by definition —
One way — Dot Matrices.One way — Dot Matrices.
Provide a ‘Gestalt’ of all possible alignments Provide a ‘Gestalt’ of all possible alignments
between two sequences.between two sequences.
To begin — very simple 0, 1 (match, To begin — very simple 0, 1 (match,
nomatch) identity scoring function.nomatch) identity scoring function.
Put a dot wherever symbols match.Put a dot wherever symbols match.
So, to introduce the concept of So, to introduce the concept of
sequence comparison, a graphical sequence comparison, a graphical
method . . . method . . .
Identities and insertion/deletion events (indels) Identities and insertion/deletion events (indels)
identified (zero:one match score matrix, no window).identified (zero:one match score matrix, no window).
Noise due to random composition contributes to confusion. To ‘clean up’ the Noise due to random composition contributes to confusion. To ‘clean up’ the plot consider a filtered windowing approach. A dot is placed at the middle of a plot consider a filtered windowing approach. A dot is placed at the middle of a window if some ‘stringency’ is met within that defined window size. Then the window if some ‘stringency’ is met within that defined window size. Then the window is shifted one position and the entire process is repeated window is shifted one position and the entire process is repeated (zero:one (zero:one match score, match score, window of size three and a stringency level of two out of threewindow of size three and a stringency level of two out of three).).
Dot matrix analysis requires Dot matrix analysis requires two programs in the two programs in the Wisconsin Package —Wisconsin Package —
Compare generates the data that Compare generates the data that serves as input to DotPlot, which serves as input to DotPlot, which actually draws the matrix.actually draws the matrix.
Let’s see how a couple of the Let’s see how a couple of the elongation factors that we found elongation factors that we found earlier look using this method.earlier look using this method.
SW:EF11_Human vs. SW:EF11_Human vs. SW:EF1a_SchcoSW:EF1a_Schco
We can compare one molecule against another by We can compare one molecule against another by
aligning them. However, a ‘brute force’ approach just aligning them. However, a ‘brute force’ approach just
won’t work. Even without considering the introduction of won’t work. Even without considering the introduction of
gaps, the computation required to compare all possible gaps, the computation required to compare all possible
alignments between two sequences requires time alignments between two sequences requires time
proportional to the product of the lengths of the two proportional to the product of the lengths of the two
sequences. Therefore, if the two sequences are sequences. Therefore, if the two sequences are
approximately the same length (N), this is a Napproximately the same length (N), this is a N22 problem. problem.
To include gaps, we would have to repeat the To include gaps, we would have to repeat the
calculation 2N times to examine the possibility of gaps calculation 2N times to examine the possibility of gaps
at each possible position within the sequences, now a at each possible position within the sequences, now a
NN4N4N problem. There’s no way! We need an algorithm. problem. There’s no way! We need an algorithm.
Exact alignment — but how can we ‘see’ the Exact alignment — but how can we ‘see’ the correspondence of individual residues?correspondence of individual residues?
But —But —Just what the heck is an algorithm ! ?Just what the heck is an algorithm ! ?
Merriam-Webster’s says: “A rule Merriam-Webster’s says: “A rule of procedure for solving a of procedure for solving a problem [often mathematical] that problem [often mathematical] that frequently involves repetition of frequently involves repetition of an operation.”an operation.”
So, you could write an algorithm So, you could write an algorithm for tying your shoe! It’s just a set for tying your shoe! It’s just a set of explicit instructions for doing of explicit instructions for doing some routine task.some routine task.
Enter the Dynamic Programming Algorithm!Enter the Dynamic Programming Algorithm!Computer scientists figured it out long ago; Needleman and Wunsch Computer scientists figured it out long ago; Needleman and Wunsch applied it to the alignment of the full lengths of two sequences in applied it to the alignment of the full lengths of two sequences in 1970. An optimal alignment is defined as an arrangement of two 1970. An optimal alignment is defined as an arrangement of two sequences, 1 of length sequences, 1 of length ii and 2 of length and 2 of length jj, such that:, such that:
1)1) you maximize the number of matching symbols between 1 and you maximize the number of matching symbols between 1 and 2;2;
2)2) you minimize the number of indels within 1 and 2; andyou minimize the number of indels within 1 and 2; and
3)3) you minimize the number of mismatched symbols between 1 you minimize the number of mismatched symbols between 1 and 2.and 2.
Therefore, the actual solution can be represented by:Therefore, the actual solution can be represented by:
SSii-1 -1 jj-1-1 or or max Smax Si-xi-x j-j-11 + w + wx-x-11 or orSSijij = s = sijij + max 2 < + max 2 < xx < < ii max Smax Sii-1 -1 j-yj-y + w + wy-y-11
2 < 2 < yy < < IIWhere SWhere Sij ij is the score for the alignment ending at is the score for the alignment ending at ii in sequence in sequence
1 and 1 and jj in sequence 2, in sequence 2,
ssijij is the score for aligning is the score for aligning ii with with jj,,
wwxx is the score for making a is the score for making a xx long gap in sequence 1, long gap in sequence 1,
wwyy is the score for making a is the score for making a yy long gap in sequence 2, long gap in sequence 2,
allowing gaps to be any length in either sequence.allowing gaps to be any length in either sequence.
An oversimplified example —An oversimplified example —
total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one here}])here}])
Optimum Alignments —Optimum Alignments —There may be more than one best path through the matrix, and optimum doesn’t guarantee biologically correct. Starting at the top and working down, then tracing back, I found one best alignment:
cTATAtAagg| ||||| cg.TAtAaT.
With our example’s scoring scheme this alignment’s final score is 5, the highest bottom-right score in the trace-back path graph, and the sum of six matches minus one interior gap. This is the number optimized by the algorithm, not any type of a percentage! Only one optimal solution will be reported. Do you have any ideas about how others can be discovered, besides alternate trace back paths? Answer — Often if you reverse the solution of the entire process, other solutions will be found!
This was a global solution. Smith Waterman style local solutions (1981) use negative numbers in the match matrix and pick the best diagonal within overall graph gives local.
What about proteins — conservative replacements and similarity as What about proteins — conservative replacements and similarity as opposed to identity. The nitrogenous bases, A, C, T, G, are either the opposed to identity. The nitrogenous bases, A, C, T, G, are either the same or they’re not, but amino acids can be similar, genetically, same or they’re not, but amino acids can be similar, genetically, evolutionarily, and structurally! Enter log-odds scoring matrices.evolutionarily, and structurally! Enter log-odds scoring matrices.
Notice that positive values for identity range from 4 to 11 and negative values for those Notice that positive values for identity range from 4 to 11 and negative values for those
substitutions that rarely occur go as low as –4. The most conserved residue is tryptophan with a substitutions that rarely occur go as low as –4. The most conserved residue is tryptophan with a
score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for
identity.identity.
BLOSUM62 amino acid substitution matrix (the default in many sequence analysis programs).
A C D E F G H I K L M N P Q R S T V W X Y
A 44 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -1 -2
C 0 99 -3 -4-4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2
D -2 -3 66 2 -3 -1 -1 -3 -1 -4-4 -3 1 -1 0 -2 0 -1 -3 -4-4 -1 -3
E -1 -4-4 2 55 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2
F -2 -2 -3 -3 66 -3 -1 0 -3 0 0 -3 -4-4 -3 -3 -2 -2 -1 1 -1 3
G 0 -3 -1 -2 -3 66 -2 -4-4 -2 -4-4 -3 0 -2 -2 -2 0 -2 -3 -2 -1 -3
H -2 -3 -1 0 -1 -2 88 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 -1 2
I -1 -1 -3 -3 0 -4-4 -3 44 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1
K -1 -3 -1 1 -3 -2 -1 -3 55 -2 -1 0 -1 1 2 0 -1 -2 -3 -1 -2
L -1 -1 -4-4 -3 0 -4-4 -3 2 -2 44 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1
M -1 -1 -3 -2 0 -3 -2 1 -1 2 55 -2 -2 0 -1 -1 -1 1 -1 -1 -1
N -2 -3 1 0 -3 0 1 -3 0 -3 -2 66 -2 0 0 1 0 -3 -4-4 -1 -2
P -1 -3 -1 -1 -4-4 -2 -2 -3 -1 -3 -2 -2 77 -1 -2 -1 -1 -2 -4-4 -1 -3
Q -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 55 1 0 -1 -2 -2 -1 -1
R -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 55 -1 -1 -3 -3 -1 -2
S 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 44 1 -2 -3 -1 -2
T 0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 55 0 -2 -1 -2
V 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 44 -3 -1 -1
W -3 -2 -4-4 -3 1 -2 -2 -3 -3 -2 -1 -4-4 -4-4 -2 -3 -3 -2 -3 11 11 -1 2
X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
Y -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 77
We can imagine screening databases for sequences We can imagine screening databases for sequences
similar to ours using the concepts of dynamic similar to ours using the concepts of dynamic
programming and log-odds scoring matrices and yet to programming and log-odds scoring matrices and yet to
be described algorithmic tricks.be described algorithmic tricks.
But why even bother? But why even bother? Inference Inference
through homology is a through homology is a
fundamental principle of biologyfundamental principle of biology!!
When a sequence is found to fall into a preexisting When a sequence is found to fall into a preexisting
family we may be able to infer function, mechanism, family we may be able to infer function, mechanism,
evolution, perhaps even structure, based on homology evolution, perhaps even structure, based on homology
with its neighbors.with its neighbors.
So, first — So, first — SignificanceSignificance: :
when is any alignment worth when is any alignment worth
anything biologically?anything biologically?
An old statistics trick — An old statistics trick — Monte CarloMonte Carlo simulations: simulations:
Z scoreZ score = [ = [ ( actual score ) - ( mean of randomized scores )( actual score ) - ( mean of randomized scores ) ] ]
( standard deviation of randomized score distribution )( standard deviation of randomized score distribution )
Independent of all that, what is a Independent of all that, what is a
‘good’ alignment?‘good’ alignment?
The Wisconsin Package dynamic The Wisconsin Package dynamic programmings tools —programmings tools —
BestFit — Smith Waterman local BestFit — Smith Waterman local alignments,alignments,
Gap — Needleman Wunsch global Gap — Needleman Wunsch global alignments,alignments,
FrameAlign — nucleotide to protein, either FrameAlign — nucleotide to protein, either local or global.local or global.
I’ll illustrate in SeqLab with same previous I’ll illustrate in SeqLab with same previous example, but at the command line:example, but at the command line:bestfit sw:ef11_human sw:ef1a_schco -shuffle=100bestfit sw:ef11_human sw:ef1a_schco -shuffle=100
The The NormalNormal distributiondistribution — —
Many Z scores measure the distance from the mean Many Z scores measure the distance from the mean
using this simplistic Monte Carlo model assuming a using this simplistic Monte Carlo model assuming a
Gaussian distribution, a.k.a. the Normal distribution Gaussian distribution, a.k.a. the Normal distribution
((http://mathworld.wolfram.com/NormalDistribution.html),http://mathworld.wolfram.com/NormalDistribution.html),
in spite of the fact that ‘sequence-space’ actually in spite of the fact that ‘sequence-space’ actually
follows what is know as the ‘Extreme Value follows what is know as the ‘Extreme Value
distribution.’distribution.’
Regardless, Monte Carlo methods approximate Regardless, Monte Carlo methods approximate
significance estimates pretty well.significance estimates pretty well.
< 2
0 6
50
0
:==
< 2
0 6
50
0
:==
2
2 0
0
:2
2 0
0
: 2
4 3
0
:=2
4 3
0
:= 2
6 2
2 8
:*2
6 2
2 8
:* 2
8 9
8 8
7:*
28
9
8 8
7:*
3
0 2
89
5
28
:*3
0 2
89
5
28
:* 3
2 1
71
4 2
04
2:=
==
*3
2 1
71
4 2
04
2:=
==
* 3
4 5
58
5 5
53
9:=
==
==
==
==
*3
4 5
58
5 5
53
9:=
==
==
==
==
* 3
6 1
24
95
1
13
75
:==
==
==
==
==
==
==
==
==
*==
36
1
24
95
1
13
75
:==
==
==
==
==
==
==
==
==
*==
3
8 2
19
57
1
87
99
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=*=
==
==
38
2
19
57
1
87
99
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=*=
==
==
4
0 2
88
75
4
0 2
88
75
2
62
23
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=*=
==
=2
62
23
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=*=
==
= 4
2 3
41
53
4
2 3
41
53
3
20
54
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=*=
32
05
4:=
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
*==
==
= 4
4 3
54
27
4
4 3
54
27
3
53
59
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=3
53
59
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
*=
==
* 4
6 3
62
19
4
6 3
62
19
3
60
14
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=3
60
14
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=*
==
==
* 4
8 3
36
99
4
8 3
36
99
3
44
79
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=3
44
79
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
*=
* 5
0 3
07
27
5
0 3
07
27
3
14
62
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
= *
31
46
2:=
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
* 5
2 2
72
88
2
76
61
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=*
52
2
72
88
2
76
61
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=*
5
4 2
25
38
2
36
27
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
*5
4 2
25
38
2
36
27
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
* 5
6 1
80
55
1
97
36
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
*
56
1
80
55
1
97
36
:==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
*
5
8 1
46
17
1
62
03
:==
==
==
==
==
==
==
==
==
==
==
==
= *
58
1
46
17
1
62
03
:==
==
==
==
==
==
==
==
==
==
==
==
= *
6
0 1
25
95
1
31
25
:==
==
==
==
==
==
==
==
==
==
=*
60
1
25
95
1
31
25
:==
==
==
==
==
==
==
==
==
==
=*
6
2 1
05
63
1
05
22
:==
==
==
==
==
==
==
==
=*
62
1
05
63
1
05
22
:==
==
==
==
==
==
==
==
=*
6
4 8
62
6 8
36
8:=
==
==
==
==
==
==
*=6
4 8
62
6 8
36
8:=
==
==
==
==
==
==
*= 6
6 6
42
6 6
61
4:=
==
==
==
==
=*
66
6
42
6 6
61
4:=
==
==
==
==
=*
6
8 4
77
0 5
20
3:=
==
==
==
=*
68
4
77
0 5
20
3:=
==
==
==
=*
7
0 4
01
7 4
07
7:=
==
==
=*
70
4
01
7 4
07
7:=
==
==
=*
7
2 2
92
0 3
18
6:=
==
==
*7
2 2
92
0 3
18
6:=
==
==
* 7
4 2
44
8 2
48
4:=
==
=*
74
2
44
8 2
48
4:=
==
=*
7
6 1
69
6 1
93
3:=
==
*7
6 1
69
6 1
93
3:=
==
* 7
8 1
17
8 1
50
3:=
=*
78
1
17
8 1
50
3:=
=*
8
0 9
35
1
16
7:=
*8
0 9
35
1
16
7:=
* 8
2 7
22
8
93
:=*
82
7
22
8
93
:=*
8
4 4
54
7
07
:=*
84
4
54
7
07
:=*
8
6 4
38
5
47
:*8
6 4
38
5
47
:* 8
8 3
22
4
23
:*8
8 3
22
4
23
:* 9
0 2
57
3
28
:*9
0 2
57
3
28
:* 9
2 1
75
2
53
:*
92
1
75
2
53
:*
9
4 2
10
1
96
:*
94
2
10
1
96
:*
9
6 1
02
1
52
:*
96
1
02
1
52
:*
9
8 6
3 1
17
:*
98
6
3 1
17
:*
10
0 5
8 9
1:*
1
00
5
8 9
1:*
1
02
4
0 7
0:*
1
02
4
0 7
0:*
1
04
3
0 5
4:*
1
04
3
0 5
4:*
1
06
1
7 4
2:*
1
06
1
7 4
2:*
1
08
1
4 3
3:*
1
08
1
4 3
3:*
1
10
1
4 2
5:*
1
10
1
4 2
5:*
1
12
1
2 2
0:*
1
12
1
2 2
0:*
1
14
9
1
5:*
1
14
9
1
5:*
1
16
6
1
2:*
1
16
6
1
2:*
1
18
8
9
:*
11
8 8
9
:*
>1
20
1
03
0 7
:*=
>1
20
1
03
0 7
:*=
Based on this known statistical Based on this known statistical
distribution, and robust distribution, and robust
statistical methodology, a statistical methodology, a
realistic realistic ExpectationExpectation function, function,
the the E ValueE Value, can be calculated , can be calculated
from database searches.from database searches.
The ‘take-home’ message is . . .The ‘take-home’ message is . . .
‘‘Sequence-space’ Sequence-space’ (Huh, what’s that?)(Huh, what’s that?)
actually follows the ‘Extreme Value distribution’actually follows the ‘Extreme Value distribution’((http://mathworld.wolfram.com/ExtremeValueDistribution.html).http://mathworld.wolfram.com/ExtremeValueDistribution.html).
The Expectation Value?The Expectation Value?
The higher the E value is, the more probable The higher the E value is, the more probable
that the observed match is due to chance in a that the observed match is due to chance in a
search of the same size database, and the search of the same size database, and the
lower its Z score will be, i.e. is NOT significant.lower its Z score will be, i.e. is NOT significant.
Therefore, the smaller the E value, i.e. the Therefore, the smaller the E value, i.e. the
closer it is to zero, the more significant it is and closer it is to zero, the more significant it is and
the higher its Z score will be! The E value is the higher its Z score will be! The E value is
the number that really matters.the number that really matters.
Rules of thumb for a protein search —Rules of thumb for a protein search —
The Z score represents the number of standard deviations some The Z score represents the number of standard deviations some
particular alignment is from a distribution of random alignments particular alignment is from a distribution of random alignments
(often the Normal distribution).(often the Normal distribution).
They They very roughlyvery roughly correspond to the listed E Values (based on the correspond to the listed E Values (based on the
Extreme Value distribution) for a typical protein sequence similarity Extreme Value distribution) for a typical protein sequence similarity
search. But remember probabilities are dependent on the size and search. But remember probabilities are dependent on the size and
composition of the database and even on how often you search!composition of the database and even on how often you search!
On to the searches —On to the searches —How can you search the databases for How can you search the databases for
similar sequences, if pair-wise alignments similar sequences, if pair-wise alignments
take Ntake N22 time?! time?!
Database searching programs use the two Database searching programs use the two
concepts of dynamic programming and log-odds concepts of dynamic programming and log-odds
scoring matrices; however, dynamic programming scoring matrices; however, dynamic programming
takes far too long when used against most takes far too long when used against most
sequence databases with a ‘normal’ computer. sequence databases with a ‘normal’ computer.
Remember Remember how hugehow huge the databases are! the databases are!
Therefore, the programs use tricks to make things Therefore, the programs use tricks to make things
happen faster. These tricks fall into two main happen faster. These tricks fall into two main
categories, categories, hashinghashing and and heuristicsheuristics..
Corn beef hash? Huh . . .Corn beef hash? Huh . . .Hashing is the process of breaking your sequence into Hashing is the process of breaking your sequence into
small ‘words’ or ‘k-tuples’ (think all chopped up, just like small ‘words’ or ‘k-tuples’ (think all chopped up, just like
corn beef hash) of a set size and creating a ‘look-up’ corn beef hash) of a set size and creating a ‘look-up’
table with those words keyed to position numbers. table with those words keyed to position numbers.
Computers can deal with numbers way faster than they Computers can deal with numbers way faster than they
can deal with strings of letters, and this preprocessing can deal with strings of letters, and this preprocessing
step happens very quickly.step happens very quickly.
Then when any of the word positions match part of an Then when any of the word positions match part of an
entry in the database, that match, the ‘offset,’ is saved. entry in the database, that match, the ‘offset,’ is saved.
In general, hashing reduces the complexity of the search In general, hashing reduces the complexity of the search
problem from Nproblem from N22 for dynamic programming to N, the for dynamic programming to N, the
length of all the sequences in the database.length of all the sequences in the database.
OK. Heuristics . . . What’s that?OK. Heuristics . . . What’s that?Approximation techniques are collectively known as ‘heuristics.’ Approximation techniques are collectively known as ‘heuristics.’
Webster’s defines heuristic as “serving to guide, discover, or Webster’s defines heuristic as “serving to guide, discover, or
reveal; . . . but unproved or incapable of proof.”reveal; . . . but unproved or incapable of proof.”
In database similarity searching techniques the heuristic usually In database similarity searching techniques the heuristic usually
restricts the necessary search space by calculating some sort of a restricts the necessary search space by calculating some sort of a
statistic that allows the program to decide whether further scrutiny statistic that allows the program to decide whether further scrutiny
of a particular match should be pursued. This statistic may miss of a particular match should be pursued. This statistic may miss
things depending on the parameters set — that’s what makes it things depending on the parameters set — that’s what makes it
heuristic. heuristic. ‘Worthwhile’ results at the end are compiled and the ‘Worthwhile’ results at the end are compiled and the
longest alignment within the program’s restrictions is created.longest alignment within the program’s restrictions is created.
The exact implementation varies between the different programs, The exact implementation varies between the different programs,
but the basic idea follows in most all of them.but the basic idea follows in most all of them.
Two predominant versions exist: BLAST and FastTwo predominant versions exist: BLAST and Fast
Both return local alignments, and are not a single program, but Both return local alignments, and are not a single program, but
rather a family of programs with implementations designed to rather a family of programs with implementations designed to
compare a sequence to a database in about every which way compare a sequence to a database in about every which way
imaginable.imaginable.
These include:These include:
1)1) a DNA sequence against a DNA database (not recommended unless a DNA sequence against a DNA database (not recommended unless
forced to do so because you are dealing with a non-translated region of forced to do so because you are dealing with a non-translated region of
the genome — DNA is just too darn noisy, only identity & four bases!),the genome — DNA is just too darn noisy, only identity & four bases!),
2)2) a translated (where the translation is done ‘on-the-fly’ in all six frames) a translated (where the translation is done ‘on-the-fly’ in all six frames)
version of a DNA sequence against a translated (‘on-the-fly’ six-frame) version of a DNA sequence against a translated (‘on-the-fly’ six-frame)
version of the DNA database (not available in the Fast package),version of the DNA database (not available in the Fast package),
3)3) a translated (‘on-the-fly’ six-frame) version of a DNA sequence against a a translated (‘on-the-fly’ six-frame) version of a DNA sequence against a
protein database,protein database,
4)4) a protein sequence against a translated (‘on-the-fly’ six-frame) version of a protein sequence against a translated (‘on-the-fly’ six-frame) version of
a DNA database,a DNA database,
5)5) or a protein sequence against a protein database.or a protein sequence against a protein database.
Many implementations allow for the possibility of frame shifts in Many implementations allow for the possibility of frame shifts in
translated comparisons and don’t penalize the score for doing so.translated comparisons and don’t penalize the score for doing so.
The BLAST and Fast programs — some generalitiesThe BLAST and Fast programs — some generalities
BLAST — Basic Local Alignment BLAST — Basic Local Alignment
Search Tool, developed at NCBI.Search Tool, developed at NCBI.
1)1) Normally NOT a good idea Normally NOT a good idea
to use for DNA against to use for DNA against
DNA searches w/o DNA searches w/o
translation (not optimized);translation (not optimized);
2)2) Pre-filters repeat and “low Pre-filters repeat and “low
complexity” sequence complexity” sequence
regions;regions;
4)4) Can find more than one Can find more than one
region of gapped similarity;region of gapped similarity;
5)5) Very fast heuristic and Very fast heuristic and
parallel implementation;parallel implementation;
6)6) Restricted to precompiled, Restricted to precompiled,
specially formatted specially formatted
databases;databases;
FastA — and its family of relatives, FastA — and its family of relatives,
developed by Bill Pearson at the developed by Bill Pearson at the
University of Virginia.University of Virginia.
1)1) Works well for DNA against Works well for DNA against
DNA searches (within limits DNA searches (within limits
of possible sensitivity);of possible sensitivity);
2)2) Can find only one gapped Can find only one gapped
region of similarity;region of similarity;
3)3) Relatively slow, should often Relatively slow, should often
be run in the background;be run in the background;
4)4) Does not require specially Does not require specially
prepared, preformatted prepared, preformatted
databases.databases.
The algorithms, in brief —The algorithms, in brief —
BLAST:BLAST:
Fast:Fast:
Two word hits on the Two word hits on the same diagonal above same diagonal above some some similaritysimilarity threshold triggers threshold triggers ungapped extension ungapped extension until the score isn’t until the score isn’t improved enough above improved enough above another threshold:another threshold:
the HSP.the HSP.
Find all ungapped Find all ungapped exact exact word hits; maximize the word hits; maximize the ten best continuous ten best continuous regions’ scores: regions’ scores: init1init1..
Combine non-Combine non-overlapping init overlapping init regions on different regions on different diagonals:diagonals:initninitn..
Use dynamic Use dynamic programming ‘in a programming ‘in a band’ for all regions band’ for all regions with with initninitn scores scores better than some better than some threshold: threshold: optopt score.score.
Initiate gapped extensions Initiate gapped extensions using dynamic programming for using dynamic programming for those HSP’s above a third those HSP’s above a third threshold up to the point where threshold up to the point where the score starts to drop below a the score starts to drop below a fourth threshold: yields fourth threshold: yields alignment.alignment.
I’ll illustrate with FastA —I’ll illustrate with FastA —
FastA of human elongation factor 1 alpha FastA of human elongation factor 1 alpha
searched against that list file of primitive searched against that list file of primitive
organism proteins from SwissProt.organism proteins from SwissProt.
I’ll show SeqLab’s implementation, but I’ll show SeqLab’s implementation, but
at the command line it would be:at the command line it would be:
fasta sw:ef11_human @primitive.listfasta sw:ef11_human @primitive.list
Multiple Sequence Analysis:Multiple Sequence Analysis:
Multiple Sequence Alignment.Multiple Sequence Alignment.Dynamic programming’s complexity increases exponentially with the Dynamic programming’s complexity increases exponentially with the
number of sequences being compared. N-dimensional matrix . . . .number of sequences being compared. N-dimensional matrix . . . .
Therefore — Therefore — pairwise, pairwise, progressive dynamic progressive dynamic programming restricts the programming restricts the solution to the solution to the neighborhood of only two neighborhood of only two sequences at a time.sequences at a time.
All sequences are All sequences are compared, pairwise, and compared, pairwise, and then each is aligned to its then each is aligned to its most similar partner or most similar partner or group of partners. Each group of partners. Each group of partners is then group of partners is then aligned to finish the aligned to finish the complete multiple complete multiple sequence alignment.sequence alignment.
PileUp is the Wisconsin PileUp is the Wisconsin Package’s implementation of Package’s implementation of pairwise progressive multiple pairwise progressive multiple sequence alignment.sequence alignment.
Let’s run PileUp on our ‘primitive’ Let’s run PileUp on our ‘primitive’ dataset in SeqLab. At the dataset in SeqLab. At the command line this would be:command line this would be:
pileup @primitive.listpileup @primitive.list
The consensus and motifs —The consensus and motifs —Conserved Conserved regions can be regions can be visualized with a visualized with a sliding window sliding window approach and approach and appear as appear as peaks. peaks.
QuickTime™ and aGraphics decompressor
are needed to see this picture.
P-Loop
Let’s Let’s concentrate on concentrate on the first peak the first peak seen here to seen here to simplify matters.simplify matters.
Motifs (a.k.a. signatures)Motifs (a.k.a. signatures)
GHVDHGKS
A consensus isn’t A consensus isn’t necessarily the necessarily the biologically “correct” biologically “correct” combination. combination. Therefore, build Therefore, build one-dimensional one-dimensional ‘pattern descriptors.’‘pattern descriptors.’
PROSITE Database PROSITE Database of protein families of protein families and domains - over and domains - over 1,000 motifs.1,000 motifs.
This motif, the P-This motif, the P-loop, is defined: loop, is defined: (A,G)x4GK(S,T), i.e. (A,G)x4GK(S,T), i.e. either an Alanine or either an Alanine or a Glycine, followed a Glycine, followed by four of anything, by four of anything, followed by an followed by an invariant Glycine-invariant Glycine-Lysine pair, followed Lysine pair, followed by either a Serine or by either a Serine or a Threonine.a Threonine.
Discover motifs in ‘ungapped’ Discover motifs in ‘ungapped’ sequences with the program sequences with the program Motifs in the Wisconsin Motifs in the Wisconsin Package —Package —
Again I’ll show you in SeqLab, Again I’ll show you in SeqLab, but at the command line:but at the command line:
motifs sw:ef11_humanmotifs sw:ef11_human
Enter Enter the the ProfileProfile
But motifs can not convey any degree of the ‘importance’ But motifs can not convey any degree of the ‘importance’ of the residues. of the residues. Use a position specific, two-dimensional Use a position specific, two-dimensional matrix where conserved areas of the alignment receive the matrix where conserved areas of the alignment receive the most importance and variable regions hardly matter!most importance and variable regions hardly matter!
The threonine at position 27 is absolutely conserved — it gets the highest score, 150! The aspartate at position 22 The threonine at position 27 is absolutely conserved — it gets the highest score, 150! The aspartate at position 22 substituted with a tryptophan would never happen, -87. Tryptophan is the most conserved residue on all matrix series and substituted with a tryptophan would never happen, -87. Tryptophan is the most conserved residue on all matrix series and aspartate 22 is conserved throughout the alignment — the negative matrix score of any substitution to tryptophan times the aspartate 22 is conserved throughout the alignment — the negative matrix score of any substitution to tryptophan times the high conservation at that position for aspartate equals the most negative score in the profile. Position 16 has a valine high conservation at that position for aspartate equals the most negative score in the profile. Position 16 has a valine assigned because it has the highest score, 37, but glycine also occurs several times, a score of 20. However, other assigned because it has the highest score, 37, but glycine also occurs several times, a score of 20. However, other residues are ranked in the substitution matrices as being quite similar to valine; therefore isoleucine and leucine also get residues are ranked in the substitution matrices as being quite similar to valine; therefore isoleucine and leucine also get similar scores, 24 and 14, and alanine occurs some of the time in the alignment so it gets a comparable score, 15.similar scores, 24 and 14, and alanine occurs some of the time in the alignment so it gets a comparable score, 15.
Cons A B C D E F G H I K L M N P Q R S T V W Y Z Gap LenCons A B C D E F G H I K L M N P Q R S T V W Y Z Gap Len E 11 20 -11 27 33 -21 16 10 -4 10 -9 -6 16 6 18 0 8 17 -3 -29 -15 26 12 12E 11 20 -11 27 33 -21 16 10 -4 10 -9 -6 16 6 18 0 8 17 -3 -29 -15 26 12 12 K 0 27 -40 21 22 -47 -6 7 -13 100 -20 13 27 7 27 53 14 13 -13 5 -40 28 12 12K 0 27 -40 21 22 -47 -6 7 -13 100 -20 13 27 7 27 53 14 13 -13 5 -40 28 12 12! 11! 11 P 13 3 4 3 3 -13 9 2 3 3 -2 -1 1 28 4 3 11 20 9 -21 -16 4 12 12P 13 3 4 3 3 -13 9 2 3 3 -2 -1 1 28 4 3 11 20 9 -21 -16 4 12 12 H -7 26 -6 26 26 -6 -14 99 -18 6 -12 -19 33 13 46 33 -13 -6 -19 -7 20 33 12 12H -7 26 -6 26 26 -6 -14 99 -18 6 -12 -19 33 13 46 33 -13 -6 -19 -7 20 33 12 12 I 3 -7 2 -7 -6 19 -6 -9 43 -7 29 22 -10 -4 -6 -10 -4 6 38 -17 1 -5 12 12I 3 -7 2 -7 -6 19 -6 -9 43 -7 29 22 -10 -4 -6 -10 -4 6 38 -17 1 -5 12 12 N 14 73 -19 47 33 -34 27 33 -20 27 -27 -20 100 0 26 7 22 14 -20 -20 -7 27 12 12N 14 73 -19 47 33 -34 27 33 -20 27 -27 -20 100 0 26 7 22 14 -20 -20 -7 27 12 12 I 1 -10 -1 -10 -8 26 -9 -10 46 -8 34 27 -12 -6 -8 -12 -6 5 40 -12 4 -7 12 12I 1 -10 -1 -10 -8 26 -9 -10 46 -8 34 27 -12 -6 -8 -12 -6 5 40 -12 4 -7 12 12 V 15 2 7 3 1 -1 20 -9 24 -6 14 11 -3 6 -3 -11 4 10 37 -30 -9 -1 12 12V 15 2 7 3 1 -1 20 -9 24 -6 14 11 -3 6 -3 -11 4 10 37 -30 -9 -1 12 12 V 9 -4 7 -5 -4 5 7 -8 29 -4 20 15 -6 4 -7 -9 0 19 36 -21 -2 -5 12 12V 9 -4 7 -5 -4 5 7 -8 29 -4 20 15 -6 4 -7 -9 0 19 36 -21 -2 -5 12 12 I 0 -16 16 -16 -16 55 -24 -24 118 -16 63 47 -24 -16 -24 -24 -8 16 87 -39 8 -16 12 12I 0 -16 16 -16 -16 55 -24 -24 118 -16 63 47 -24 -16 -24 -24 -8 16 87 -39 8 -16 12 12 G 55 47 16 55 39 -47 118 -16 -24 -8 -39 -24 31 24 16 -24 47 31 16 -79 -55 24 12 12G 55 47 16 55 39 -47 118 -16 -24 -8 -39 -24 31 24 16 -24 47 31 16 -79 -55 24 12 12 H -6 27 -7 27 27 -8 -13 100 -20 7 -13 -20 34 14 48 34 -13 -7 -20 -7 19 34 12 12H -6 27 -7 27 27 -8 -13 100 -20 7 -13 -20 34 14 48 34 -13 -7 -20 -7 19 34 12 12! 21! 21 V 11 -12 12 -12 -12 13 11 -18 67 -12 48 36 -18 5 -12 -18 -6 12 89 -47 -6 -12 12 12V 11 -12 12 -12 -12 13 11 -18 67 -12 48 36 -18 5 -12 -18 -6 12 89 -47 -6 -12 12 12 D 24 87 -39 118 79 -79 55 31 -16 24 -39 -31 55 8 55 0 16 16 -16 -87 -39 71 12 12D 24 87 -39 118 79 -79 55 31 -16 24 -39 -31 55 8 55 0 16 16 -16 -87 -39 71 12 12 S 9 12 11 11 11 -8 8 22 -7 5 -10 -10 14 11 11 9 23 4 -6 1 -2 9 12 12S 9 12 11 11 11 -8 8 22 -7 5 -10 -10 14 11 11 9 23 4 -6 1 -2 9 12 12 G 55 47 16 55 39 -47 118 -16 -24 -8 -39 -24 31 24 16 -24 47 31 16 -79 -55 24 12 12G 55 47 16 55 39 -47 118 -16 -24 -8 -39 -24 31 24 16 -24 47 31 16 -79 -55 24 12 12 K 0 27 -40 20 20 -47 -7 7 -14 100 -20 13 27 7 27 55 13 13 -14 8 -40 27 12 12K 0 27 -40 20 20 -47 -7 7 -14 100 -20 13 27 7 27 55 13 13 -14 8 -40 27 12 12 S 19 14 30 10 10 -14 27 -9 -2 10 -17 -12 14 19 -5 3 63 24 -2 7 -19 1 100 100S 19 14 30 10 10 -14 27 -9 -2 10 -17 -12 14 19 -5 3 63 24 -2 7 -19 1 100 100 T 40 20 20 20 20 -30 40 -10 20 20 -10 0 20 30 -10 -10 30 150 20 -60 -30 10 100 100T 40 20 20 20 20 -30 40 -10 20 20 -10 0 20 30 -10 -10 30 150 20 -60 -30 10 100 100 T 8 -4 -9 -4 0 13 1 -6 18 0 23 22 -2 2 -4 -9 0 34 18 -6 -2 -1 100 100T 8 -4 -9 -4 0 13 1 -6 18 0 23 22 -2 2 -4 -9 0 34 18 -6 -2 -1 100 100 T 19 8 10 8 8 -12 19 -6 16 8 1 4 7 14 -6 -6 13 69 18 -32 -14 3 100 100T 19 8 10 8 8 -12 19 -6 16 8 1 4 7 14 -6 -6 13 69 18 -32 -14 3 100 100 G 40 24 10 28 21 -27 61 -8 -11 -4 -19 -11 16 16 9 -14 26 18 9 -44 -28 13 100 100G 40 24 10 28 21 -27 61 -8 -11 -4 -19 -11 16 16 9 -14 26 18 9 -44 -28 13 100 100! 31! 31 H 10 11 -1 11 11 -10 1 34 -8 7 -8 -5 13 11 19 18 0 1 -6 -1 0 14 100 100H 10 11 -1 11 11 -10 1 34 -8 7 -8 -5 13 11 19 18 0 1 -6 -1 0 14 100 100 L -4 -20 -27 -20 -13 50 -21 -10 43 -13 62 53 -17 -13 -7 -17 -15 -2 40 13 12 -9 100 100L -4 -20 -27 -20 -13 50 -21 -10 43 -13 62 53 -17 -13 -7 -17 -15 -2 40 13 12 -9 100 100 * 20 0 0 27 12 3 73 70 65 46 38 0 24 11 5 6 33 85 65 0 0 0* 20 0 0 27 12 3 73 70 65 46 38 0 24 11 5 6 33 85 65 0 0 0
Advanced methodologies — wondrous stuff based on Advanced methodologies — wondrous stuff based on combinations of the previous techniques, e.g.combinations of the previous techniques, e.g.PSI-BLAST uses profile methods to iterate database searches.PSI-BLAST uses profile methods to iterate database searches.
Profiles can be optimized with hidden Markov models (HMMs) or even Profiles can be optimized with hidden Markov models (HMMs) or even
discovered in unaligned sequences using expectation maximization (MEME).discovered in unaligned sequences using expectation maximization (MEME).
Exon and intron structure can be predicted. See e.g. the genefinder at Exon and intron structure can be predicted. See e.g. the genefinder at
http://genomic.http://genomic.sangersanger.ac..ac.ukuk//gfgf//gfgf.html.html and GrailEXP at and GrailEXP at http://grail.http://grail.lsdlsd..ornlornl..
govgov//grailexpgrailexp//..
Secondary structure can often be predicted. See Secondary structure can often be predicted. See http://www.http://www.emblembl--heidelbergheidelberg
.de/.de/predictproteinpredictprotein//predictproteinpredictprotein.html.html, which uses multiple sequence , which uses multiple sequence
alignment profile techniques along with neural net technology. Even three-alignment profile techniques along with neural net technology. Even three-
dimensional “homology modeling” will often lead to remarkably accurate dimensional “homology modeling” will often lead to remarkably accurate
results if the similarity is great enough between your protein and one in which results if the similarity is great enough between your protein and one in which
the structure has been solved through experimental means. See the structure has been solved through experimental means. See
SwissModel at SwissModel at http://www.http://www.expasyexpasy..chch//swissmodswissmod/SWISS-MODEL.html/SWISS-MODEL.html..
Evolutionary relationships can be ascertained using a multiple sequence Evolutionary relationships can be ascertained using a multiple sequence
alignment and the methods of molecular phylogenetics. See the PAUP* and alignment and the methods of molecular phylogenetics. See the PAUP* and
PHYLIP software packages. And if you’re really interested in this topic check PHYLIP software packages. And if you’re really interested in this topic check
out the out the Workshop on Molecular EvolutionWorkshop on Molecular Evolution offered every August at the Woods offered every August at the Woods
Hole Marine Biological Laboratory and/or similar courses worldwide.Hole Marine Biological Laboratory and/or similar courses worldwide.
Finally, what’s the deal with DNA versus Finally, what’s the deal with DNA versus protein for searches and alignment?protein for searches and alignment?
All database similarity searching and sequence All database similarity searching and sequence
alignment, regardless of the algorithm, is far more alignment, regardless of the algorithm, is far more
sensitive at the amino acid level than with DNA. This is sensitive at the amino acid level than with DNA. This is
because proteins have twenty match criteria versus because proteins have twenty match criteria versus
DNA’s four, and those four DNA bases can generally only DNA’s four, and those four DNA bases can generally only
be identical, not similar, to each other; and many DNA be identical, not similar, to each other; and many DNA
base changes (especially third position changes) do not base changes (especially third position changes) do not
change the encoded protein.change the encoded protein.
All of these factors drastically increase the ‘noise’ level of All of these factors drastically increase the ‘noise’ level of
a DNA against DNA search, and give protein searches a a DNA against DNA search, and give protein searches a
much greater ‘look-back’ time, at least doubling it. much greater ‘look-back’ time, at least doubling it.
Therefore, whenever dealing with coding sequence, it is Therefore, whenever dealing with coding sequence, it is
always prudent to work at the protein level!always prudent to work at the protein level!
FOR MORE INFO...FOR MORE INFO...See http://bio.fsu.edu/~stevet/workshop.html and contact me
(stevet@bio.fsu.edu) for further bioinformatics assistance.
Conclusions — A comprehensive sequence analysis software Conclusions — A comprehensive sequence analysis software
suite, such as the Wisconsin Package, expedites suite, such as the Wisconsin Package, expedites
bioinformatics, putting a large assortment of tools all under bioinformatics, putting a large assortment of tools all under
one organizational model with one user interface.one organizational model with one user interface.
The better you understand the chemical, physical, and biological system The better you understand the chemical, physical, and biological system
under study, the better your chance of success in their analysis. Certain under study, the better your chance of success in their analysis. Certain
strategies are inherently more appropriate than others. Making these strategies are inherently more appropriate than others. Making these
types of subjective, discriminatory decisions is one of the most important types of subjective, discriminatory decisions is one of the most important
‘take-home’ messages I can offer!‘take-home’ messages I can offer!
Gunnar von Heijne in his old but quite readable treatise, Gunnar von Heijne in his old but quite readable treatise, Sequence Sequence
Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), (1987),
provides a very appropriate conclusion:provides a very appropriate conclusion:
““Think about what you’re doing; use your knowledge of the molecular Think about what you’re doing; use your knowledge of the molecular
system involved to guide both your interpretation of results and your system involved to guide both your interpretation of results and your
direction of inquiry; use as much information as possible; and direction of inquiry; use as much information as possible; and do not do not
blindly accept everything the computer offers youblindly accept everything the computer offers you.”.”
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular BiologyJournal of Molecular Biology 215215, 403-410., 403-410.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs. Generation of Protein Database Search Programs. Nucleic Acids ResearchNucleic Acids Research 2525, 3389-3402., 3389-3402.
Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids ResearchNucleic Acids Research 2020, 2013-2018., 2013-2018.
Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle, Washington, U.S.A.Seattle, Washington, U.S.A.
Genetics Computer Group (GCG), Inc. (Copyright 1982-2000) Genetics Computer Group (GCG), Inc. (Copyright 1982-2000) Program Manual for the Wisconsin PackageProgram Manual for the Wisconsin Package, Version 10.1, Madison, Wisconsin, USA , Version 10.1, Madison, Wisconsin, USA 53711.53711.
Gribskov, M. and Devereux, J., editors (1992) Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis PrimerSequence Analysis Primer. W.H. Freeman and Company, New York, N.Y., U.S.A.. W.H. Freeman and Company, New York, N.Y., U.S.A.
Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A.Proc. Natl. Acad. Sci. U.S.A. 8484, 4355-4358., 4355-4358.
Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences U.S.A.Proceedings of the National Academy of Sciences U.S.A. 8989, 10915-10919., 10915-10919.
Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular BiologyJournal of Molecular Biology 4848, 443-453., 443-453.
Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio 1994. 1994. Nucleic Acids ResearchNucleic Acids Research 2222, 3470-3473., 3470-3473.
Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences U.S.A.Proceedings of the National Academy of Sciences U.S.A. 8585, , 2444-2448.2444-2448.
Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. Journal of Molecular BiologyJournal of Molecular Biology 232232, 584-599., 584-599.
Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. (1994) The Genetic Data Environment, an Expandable GUI for Multiple Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. (1994) The Genetic Data Environment, an Expandable GUI for Multiple Sequence Analysis. Sequence Analysis. CABIOSCABIOS, , 1010, 671-675., 671-675.
Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and StructureAtlas of Protein Sequences and Structure, (M.O. Dayhoff , (M.O. Dayhoff editor) editor) 55, Suppl. , Suppl. 33, 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A., 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A.
Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied MathematicsAdvances in Applied Mathematics 22, 482-489., 482-489.
Sundaralingam, M., Mizuno, H., Stout, C.D., Rao, S.T., Liedman, M., and Yathindra, N. (1976) Mechanisms of Chain Folding in Nucleic Acids. The Sundaralingam, M., Mizuno, H., Stout, C.D., Rao, S.T., Liedman, M., and Yathindra, N. (1976) Mechanisms of Chain Folding in Nucleic Acids. The Omega Plot and its Correlation to the Nucleotide Geometry in Yeast tRNAPhe1. Omega Plot and its Correlation to the Nucleotide Geometry in Yeast tRNAPhe1. Nucleic Acids ResearchNucleic Acids Research 1010, 2471-2484., 2471-2484.
Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) personal copyright, and (1997) Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) personal copyright, and (1997) Smithsonian Institution, Washington D.C., U.S.A.Smithsonian Institution, Washington D.C., U.S.A.
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids ResearchNucleic Acids Research, , 2222, 4673-4680., 4673-4680.
von Heijne, G. (1987) von Heijne, G. (1987) Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit.Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press, Inc., San Diego, California, U.S.A. Academic Press, Inc., San Diego, California, U.S.A.
Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Proceedings of the National Academy of Sciences U.S.A.Sciences U.S.A. 8080, 726-730., 726-730.
Zuker, M. (1989) On Finding All Suboptimal Foldings of an RNA Molecule. Zuker, M. (1989) On Finding All Suboptimal Foldings of an RNA Molecule. ScienceScience 244244, 48-52., 48-52.
ReferencesReferences
top related