A BioInformatics A BioInformatics Survey Survey . . . . . . some taste of some taste of theory, and theory, and a few practicalities a few practicalities Steve Thompson Steve Thompson Florida State University Florida State University School of Computational School of Computational Science (SCS) Science (SCS) BCH 5405 BCH 5405 Molecular Biology & Biotechnology Molecular Biology & Biotechnology Dr. Qing-Xiang (Amy) Sang Dr. Qing-Xiang (Amy) Sang Mon. & Wed., Mon. & Wed., March 24 & 26, 2008 March 24 & 26, 2008
62
Embed
A BioInformatics Survey... some taste of theory, and a few practicalities Steve Thompson Steve Thompson Florida State University School of Computational.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A BioInformatics SurveyA BioInformatics Survey . . . . . . some taste of theory, and some taste of theory, and
a few practicalitiesa few practicalities
Steve ThompsonSteve Thompson
Florida State University School of Florida State University School of Computational Science (SCS)Computational Science (SCS)
To begin,To begin,some terminology —some terminology —
What is bioinformatics, What is bioinformatics,
genomics, proteomics, genomics, proteomics,
sequence analysis, sequence analysis,
computational molecular computational molecular
biology . . . ?biology . . . ?
My definitions, My definitions, lots of overlaplots of overlap — —BiocomputingBiocomputing and and computational biologycomputational biology are synonyms and are synonyms and
describe the use of computers and computational techniques describe the use of computers and computational techniques
to analyze any type of a biological system, from individual to analyze any type of a biological system, from individual
molecules to organisms to overall ecology.molecules to organisms to overall ecology.
BioinformaticsBioinformatics describes using computational techniques to describes using computational techniques to
access, analyze, and interpret the biological information in access, analyze, and interpret the biological information in
any type of biological database.any type of biological database.
Sequence analysisSequence analysis is the study of molecular sequence data for is the study of molecular sequence data for
the purpose of inferring the function, interactions, evolution, the purpose of inferring the function, interactions, evolution,
and perhaps structure of biological molecules.and perhaps structure of biological molecules.
GenomicsGenomics analyzes the context of genes or complete genomes analyzes the context of genes or complete genomes
(the total DNA content of an organism) within the same and/or (the total DNA content of an organism) within the same and/or
across different genomes.across different genomes.
ProteomicsProteomics is the subdivision of genomics concerned with is the subdivision of genomics concerned with
analyzing the complete protein complement, i.e. the proteome, analyzing the complete protein complement, i.e. the proteome,
of organisms, both within and between different organisms.of organisms, both within and between different organisms.
And one way to think about it —And one way to think about it —the Reverse Biochemistry Analogythe Reverse Biochemistry AnalogyBiochemists no longer have to begin a research Biochemists no longer have to begin a research
project by isolating and purifying massive amounts project by isolating and purifying massive amounts
of a protein from its native organism in order to of a protein from its native organism in order to
characterize a particular gene product. Rather, characterize a particular gene product. Rather,
now scientists can amplify a section of some now scientists can amplify a section of some
genome based on its similarity to other genomes, genome based on its similarity to other genomes,
sequence that piece of DNA and, sequence that piece of DNA and, using sequence using sequence
analysis tools, infer all sorts of functional, analysis tools, infer all sorts of functional,
evolutionary, and, perhaps, structural insight into evolutionary, and, perhaps, structural insight into
that stretch of DNA!that stretch of DNA!
The computer and molecular databases are a The computer and molecular databases are a
necessary, integral part of this entire process.necessary, integral part of this entire process.
The exponential growth of molecular sequence databases
YearYear BasePairs BasePairs
SequencesSequences
19821982 680338 680338
606606
19831983 2274029 2274029
24272427
19841984 3368765 3368765
41754175
19851985 5204420 5204420
57005700
19861986 9615371 9615371
99789978
19871987 1551477615514776
1458414584
19881988 23800000 23800000
2057920579
19891989 34762585 34762585
2879128791
19901990 49179285 49179285
3953339533
19911991 71947426 71947426
5562755627
19921992 101008486 101008486
7860878608
19931993 157152442 157152442
143492143492
19941994 217102462 217102462
215273215273
19951995 384939485 384939485
555694555694
19961996 651972984 651972984
10212111021211
19971997 1160300687 1160300687
17658471765847
19981998 2008761784 2008761784
28378972837897
19991999 3841163011 3841163011
4864570 4864570
20002000 1110106628811101066288
1010602310106023
20012001 1584992143815849921438
1497631014976310
20022002 2850799016628507990166
2231888322318883
20032003 3655336848536553368485
3096841830968418
20042004 4457574517644575745176
4060431940604319
20052005 5603773446256037734462
5201676252016762
20062006 6901929070569019290705
6489374764893747
20072007 8387417973083874179730
8038838280388382
& cpu power —& cpu power —
Doubling time about a year and half!Doubling time about a year and half!
The International Human Genome Sequencing The International Human Genome Sequencing
Consortium announced the completion of the "Working Consortium announced the completion of the "Working
Draft" of the human genome in June 2000; Draft" of the human genome in June 2000;
independently that same month, the private company independently that same month, the private company
Celera GenomicsCelera Genomics announced that it had completed the announced that it had completed the
first “Assembly” of the human genome. The classic first “Assembly” of the human genome. The classic
articles were published mid-February 2001 in the articles were published mid-February 2001 in the
journals journals ScienceScience and and NatureNature. .
Genome projects have kept the data coming at an Genome projects have kept the data coming at an
incredible rate. incredible rate. Currently around 50 Archaea, 600 Currently around 50 Archaea, 600
Bacteria, and 20 Eukaryote complete genomes, and 200 Bacteria, and 20 Eukaryote complete genomes, and 200
Eukaryote assemblies are represented, not counting the Eukaryote assemblies are represented, not counting the
almost 3,000 virus and viroid genomes available.almost 3,000 virus and viroid genomes available.
Some neat stuff from the human genome papers —Some neat stuff from the human genome papers —
Homo sapiensHomo sapiens, aren’t nearly as special as we once , aren’t nearly as special as we once thought. Of the 3.2 billion base pairs in our DNA:thought. Of the 3.2 billion base pairs in our DNA:
Traditional gene number estimates were often in the Traditional gene number estimates were often in the 100,000 range; turns out we’ve only got about twice 100,000 range; turns out we’ve only got about twice as many as a fruit fly, between 25’ and 30,000!as many as a fruit fly, between 25’ and 30,000!
The protein coding region of the genome is only about The protein coding region of the genome is only about 1% or so, a bunch of the remainder is ‘jumping,’ 1% or so, a bunch of the remainder is ‘jumping,’ ‘junk,’ ‘selfish DNA,’ much of which may be involved ‘junk,’ ‘selfish DNA,’ much of which may be involved in regulation and control.in regulation and control.
Some 100-200 genes were transferred from an Some 100-200 genes were transferred from an ancestral bacterial genome to an ancestral ancestral bacterial genome to an ancestral vertebrate genome!vertebrate genome!((Later shown to be false by more extensive analyses, and Later shown to be false by more extensive analyses, and to be due to gene loss not transferto be due to gene loss not transfer.).)
NCBI’s ’s
Entrez Entrez
Sequence databases are an organized way to store exponentially Sequence databases are an organized way to store exponentially
accumulating sequence data. An accumulating sequence data. An ‘alphabet soup’ of t‘alphabet soup’ of three major hree major
organizations maintain them. They largely ‘mirror’ one another and organizations maintain them. They largely ‘mirror’ one another and
share accession codes, but NOT proper identifier names:share accession codes, but NOT proper identifier names:
North America: the National Center for Biotechnology Information (North America: the National Center for Biotechnology Information (
NCBI), a division of the National Library of Medicine (NLM), at the ), a division of the National Library of Medicine (NLM), at the
National Institute of Health (NIH), maintains the National Institute of Health (NIH), maintains the GenBank (& WGS) (& WGS)
nucleotide, GenPept amino acid, and RefSeq genome, nucleotide, GenPept amino acid, and RefSeq genome,
transcriptome, and proteome databases.transcriptome, and proteome databases.
Europe: the European Molecular Biology Laboratory (Europe: the European Molecular Biology Laboratory (EMBL), the ), the
European Bioinformatics Institute (European Bioinformatics Institute (EBI), and the ), and the Swiss Institute of Swiss Institute of
Bioinformatics (SIB) Bioinformatics (SIB) all help maintain theall help maintain the EMBL nucleotide nucleotide
sequence database, andsequence database, and the UNIPROT ( the UNIPROT (SWISS-PROT + + TrEMBL)
amino acid sequence database (with USA PIR/NBRF support also).amino acid sequence database (with USA PIR/NBRF support also).
Asia: TAsia: The National Institute of Genetics (NIG) supports the National Institute of Genetics (NIG) supports the he Center Center
for Information Biology’s (CIG) for Information Biology’s (CIG) DNA Data Bank of Japan (DNA Data Bank of Japan (DDBJ). ).
Let’s start with sequence databases —Let’s start with sequence databases —
A little history —A little history —The first well recognized sequence database was Dr. The first well recognized sequence database was Dr.
Margaret Dayhoff’s hardbound Margaret Dayhoff’s hardbound Atlas of Protein Atlas of Protein
Sequence and StructureSequence and Structure begun in the mid-sixties. begun in the mid-sixties.
That became PIR. That became PIR. DDBJDDBJ began in 1984, began in 1984, GenBankGenBank
in 1982, and in 1982, and EMBLEMBL in 1980. They are all attempts at in 1980. They are all attempts at
establishing an organized, reliable, comprehensive, establishing an organized, reliable, comprehensive,
and openly available library of genetic sequences.and openly available library of genetic sequences.
Sequence databases have long-since outgrown a Sequence databases have long-since outgrown a
hardbound atlas that you can pull off of a library shelf. hardbound atlas that you can pull off of a library shelf.
They have become gargantuan and have evolved They have become gargantuan and have evolved
through many, many changes.through many, many changes.
What are sequence databases like?What are sequence databases like?
Just what are primary sequences?Just what are primary sequences?
(Central Dogma: DNA —> RNA —> protein)(Central Dogma: DNA —> RNA —> protein)
Primary refers to one dimension — all of the ‘symbol’ information Primary refers to one dimension — all of the ‘symbol’ information
written in sequential order necessary to specify a particular written in sequential order necessary to specify a particular
biological molecular entity, be it polypeptide or nucleotide.biological molecular entity, be it polypeptide or nucleotide.
The symbols are the one letter codes for all of the biological The symbols are the one letter codes for all of the biological
nitrogenous bases and amino acid residues and their ambiguity nitrogenous bases and amino acid residues and their ambiguity
codes. Biological carbohydrates, lipids, and structural and codes. Biological carbohydrates, lipids, and structural and
functional information are not sequence data. Not even DNA functional information are not sequence data. Not even DNA
CDS translations in a DNA database are sequence data!CDS translations in a DNA database are sequence data!
However, much of this feature and bibliographic type information is However, much of this feature and bibliographic type information is
available in the reference documentation sections associated available in the reference documentation sections associated
with primary sequences in the databases.with primary sequences in the databases.
Sequence database installations are commonly a Sequence database installations are commonly a
complex ASCII/Binary mix, and Web-based ones are complex ASCII/Binary mix, and Web-based ones are
often relational or Object Oriented. They usually often relational or Object Oriented. They usually
consist of several very long text files each containing consist of several very long text files each containing
different types of related information, such as all of the different types of related information, such as all of the
sequences themselves, versus all of the title lines, or sequences themselves, versus all of the title lines, or
all of the reference sections. Binary files often help all of the reference sections. Binary files often help
‘glue together’ all of these other files by providing ‘glue together’ all of these other files by providing
indexing functions. indexing functions.
Software is required to successfully interact with these Software is required to successfully interact with these
databases, and access is most easily handled through databases, and access is most easily handled through
various software packages and interfaces, on the various software packages and interfaces, on the
World Wide Web or otherwise. World Wide Web or otherwise.
TrEMBL (with TrEMBL (with help from PIR)help from PIR)
GenpeptGenpept
Nucleic acid sequence databases are split into subdivisions based Nucleic acid sequence databases are split into subdivisions based
on taxonomy and data type. TrEMBL sequences are merged into on taxonomy and data type. TrEMBL sequences are merged into
SWISS-PROT as they receive increased levels of annotation. SWISS-PROT as they receive increased levels of annotation.
Both together comprise UNIPROT. GenPept has minimal Both together comprise UNIPROT. GenPept has minimal
annotation.annotation.
Important Important elementselements associated with each sequence entry: associated with each sequence entry:NameName: LOCUS, ENTRY, ID, all are unique identifiers.: LOCUS, ENTRY, ID, all are unique identifiers.DefinitionDefinition: : a.k.a.a.k.a. title, a brief textual sequence description. title, a brief textual sequence description.Accession NumberAccession Number: a constant data identifier.: a constant data identifier.Source and taxonomy information;Source and taxonomy information;complete literature references;complete literature references;comments and keywords; and the all important comments and keywords; and the all important FEATUREFEATURE table!table!A summary or checksum line, and the A summary or checksum line, and the sequencesequence itself. itself.
ButBut::Each major database as well as each major suite of software Each major database as well as each major suite of software tools has its own distinct format requirements. Changes over tools has its own distinct format requirements. Changes over the years are a huge hassle. Standards are argued, e.g. XML, the years are a huge hassle. Standards are argued, e.g. XML, but unfortunately, until all biologists and computer scientists but unfortunately, until all biologists and computer scientists worldwide agree on one standard, and all software is (re)written worldwide agree on one standard, and all software is (re)written to that standard, neither of which is likely to happen very to that standard, neither of which is likely to happen very quickly, if ever, format issues will remain quickly, if ever, format issues will remain one of the most one of the most confusing and troublingconfusing and troubling aspects of working with sequence data. aspects of working with sequence data. Specialized format conversion tools expedite the chore, but Specialized format conversion tools expedite the chore, but becoming familiar with some of the common formats helps a lot.becoming familiar with some of the common formats helps a lot.
Parts and problems —Parts and problems —
More format complications —More format complications —
Indels and missing Indels and missing
data symbols (i.e. data symbols (i.e.
gaps) designation gaps) designation
discrepancy discrepancy
headaches —headaches —
., -, ~, ?, N, or X., -, ~, ?, N, or X
. . . . . Help!. . . . . Help!
Specialized ‘sequence’ -type databases —Specialized ‘sequence’ -type databases —Databases that contain special types of sequence Databases that contain special types of sequence
information, such as patterns, motifs, and profiles. information, such as patterns, motifs, and profiles.
These include: These include: REBASEREBASE, , EPDEPD, , PROSITEPROSITE, , BLOCKSBLOCKS, ,
ProDomProDom, , PfamPfam . . . . . . . .
Databases that contain multiple sequence entries Databases that contain multiple sequence entries
aligned, e.g. aligned, e.g. PopSetPopSet, , RDPRDP and and ALNALN..
Databases that contain families of sequences ordered Databases that contain families of sequences ordered
functionally, structurally, or phylogenetically, e.g. functionally, structurally, or phylogenetically, e.g.
iProClassiProClass and and HOVERGENHOVERGEN..
Databases of species specific sequences, e.g. the HIV Databases of species specific sequences, e.g. the HIV
Database and the Database and the Giardia lambliaGiardia lamblia Genome Project. Genome Project.
And on and on . . . . See Amos Bairoch’s excellent links And on and on . . . . See Amos Bairoch’s excellent links
What about other types of biological databases? Three-dimensional structure databases —
the Protein Data Bank and Rutgers Nucleic Acid Database.the Protein Data Bank and Rutgers Nucleic Acid Database.
And see Molecules to Go at And see Molecules to Go at http://molbio.info.nih.gov/cgi-bin/pdb/.http://molbio.info.nih.gov/cgi-bin/pdb/.
These databases contain all of the 3D atomic coordinate data These databases contain all of the 3D atomic coordinate data
necessary to define the tertiary shape of a particular biological necessary to define the tertiary shape of a particular biological
molecule. The data is usually experimentally derived, either by X-molecule. The data is usually experimentally derived, either by X-
ray crystallography or by NMR, sometimes it’s hypothetical. The ray crystallography or by NMR, sometimes it’s hypothetical. The
source of the structure and its resolution is always given.source of the structure and its resolution is always given.
Secondary structure boundaries, sequence data, and reference Secondary structure boundaries, sequence data, and reference
information are often associated with the coordinate data, but it is information are often associated with the coordinate data, but it is
the 3D data that really matters, not the annotation.the 3D data that really matters, not the annotation.
Molecular visualization or modeling software is required to interact Molecular visualization or modeling software is required to interact
with the data. It has little meaning on its own.with the data. It has little meaning on its own.
And still other types of bioinfo’ databases —And still other types of bioinfo’ databases —Consider these ‘non-molecular’ but they often link to molecules:Consider these ‘non-molecular’ but they often link to molecules:
Reference DatabasesReference Databases (all w/ pointers to sequences): e.g. (all w/ pointers to sequences): e.g.
LocusLink/Gene — integrated knowledge baseLocusLink/Gene — integrated knowledge base
OMIM — Online Mendelian Inheritance in ManOMIM — Online Mendelian Inheritance in Man
PubMed/MedLine — over 11 million citations PubMed/MedLine — over 11 million citations
from more than 4 thousand bio/medical from more than 4 thousand bio/medical
scientific journals. scientific journals.
Phylogenetic Tree DatabasesPhylogenetic Tree Databases: e.g. the Tree of Life.: e.g. the Tree of Life.
Metabolic Pathway DatabasesMetabolic Pathway Databases: e.g. WIT (What Is There), : e.g. WIT (What Is There),
Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of
Genes and Genomes), and the human Reactome.Genes and Genomes), and the human Reactome.
Population studies dataPopulation studies data — which strains, where, etc. — which strains, where, etc.
And then databases that many biocomputing people don’t even And then databases that many biocomputing people don’t even
usually consider: e.g. GIS/GPS/remote sensing data, medical usually consider: e.g. GIS/GPS/remote sensing data, medical
records, census counts, mortality and birth rates . . . .records, census counts, mortality and birth rates . . . .
Enter pairwise alignment, Enter pairwise alignment,
similarity searching, similarity searching,
significance, and significance, and
homology.homology.
OK, given your own experimentally derived OK, given your own experimentally derived
nucleotide or amino acid sequence, or one nucleotide or amino acid sequence, or one
that you’ve found in a database, what more that you’ve found in a database, what more
can we learn about its biological function?can we learn about its biological function?
First, just what is homology and First, just what is homology and
similarity — are they the same?similarity — are they the same?
Don’t confuse homology with similarity: Don’t confuse homology with similarity:
there is a huge difference! Similarity is a there is a huge difference! Similarity is a
statistic that describes how much two statistic that describes how much two
(sub)sequences are alike according to (sub)sequences are alike according to
some set scoring criteria. It can be some set scoring criteria. It can be
normalized to ascertain statistical normalized to ascertain statistical
significance, but it’s still just a number.significance, but it’s still just a number.
implies an evolutionary relationship — more than just implies an evolutionary relationship — more than just
everything evolving from the same primordial ‘ooze.’ everything evolving from the same primordial ‘ooze.’
Reconstruct the phylogeny of the organisms or genes of Reconstruct the phylogeny of the organisms or genes of
interest to demonstrate homology. Better yet, show interest to demonstrate homology. Better yet, show
match score matrix, no window).match score matrix, no window).
Noise due to random composition effects contributes to confusion. To ‘clean up’ Noise due to random composition effects contributes to confusion. To ‘clean up’ the plot consider a filtered windowing approach. A dot is placed at the middle of the plot consider a filtered windowing approach. A dot is placed at the middle of a window if some ‘stringency’ is met within that defined window size. Then the a window if some ‘stringency’ is met within that defined window size. Then the window is shifted one position and the entire process is repeated window is shifted one position and the entire process is repeated (zero:one (zero:one match score, match score, window of size three and a stringency level of two out of threewindow of size three and a stringency level of two out of three).).
We can compare one molecule against another by We can compare one molecule against another by
aligning them. However, a ‘brute force’ approach just aligning them. However, a ‘brute force’ approach just
won’t work. Even without considering the introduction of won’t work. Even without considering the introduction of
gaps, the computation required to compare all possible gaps, the computation required to compare all possible
alignments between two sequences requires time alignments between two sequences requires time
proportional to the product of the lengths of the two proportional to the product of the lengths of the two
sequences. Therefore, if the two sequences are sequences. Therefore, if the two sequences are
approximately the same length (N), this is a Napproximately the same length (N), this is a N22 problem. problem.
To include gaps, we would have to repeat the To include gaps, we would have to repeat the
calculation 2N times to examine the possibility of gaps calculation 2N times to examine the possibility of gaps
at each possible position within the sequences, now a at each possible position within the sequences, now a
NN4N4N problem. There’s no way! We need an algorithm. problem. There’s no way! We need an algorithm.
Exact alignment — but how can we ‘see’ the Exact alignment — but how can we ‘see’ the correspondence of individual residues?correspondence of individual residues?
But —But —Just what the heck is an algorithm?Just what the heck is an algorithm?
Merriam-Webster’s says: “A rule Merriam-Webster’s says: “A rule of procedure for solving a of procedure for solving a problem [often mathematical] problem [often mathematical] that frequently involves repetition that frequently involves repetition of an operation.”of an operation.”
So, you could write an algorithm So, you could write an algorithm for tying your shoe! It’s just a set for tying your shoe! It’s just a set of explicit instructions for doing of explicit instructions for doing some routine task.some routine task.
Enter the Dynamic Programming Algorithm!Enter the Dynamic Programming Algorithm!Computer scientists figured it out long ago; Computer scientists figured it out long ago; Needleman and Wunsch applied it to the alignment of Needleman and Wunsch applied it to the alignment of the full lengths of two sequences in 1970. An the full lengths of two sequences in 1970. An optimal alignment is defined as an arrangement of optimal alignment is defined as an arrangement of two sequences, 1 of length two sequences, 1 of length ii and 2 of length and 2 of length jj, , such that:such that:
1)1) you maximize the number of matching symbols you maximize the number of matching symbols between 1 and 2;between 1 and 2;2)2) you minimize the number of indels within 1 and you minimize the number of indels within 1 and 2; and2; and3)3) you minimize the number of mismatched symbols you minimize the number of mismatched symbols between 1 and 2.between 1 and 2.
Therefore, the actual solution can be Therefore, the actual solution can be represented by:represented by:
SSii-1 -1 jj-1-1 or or
max Smax Si-xi-x j-j-11 + w + wx-x-11 or or
SSijij = s = sijij + max 2 < + max 2 < xx < < ii
max Smax Sii-1 -1 j-yj-y + w + wy-y-11
2 < 2 < yy < < IIWhere SWhere Sij ij is the score for the alignment ending at is the score for the alignment ending at ii in in
sequence 1 and sequence 1 and jj in sequence 2, in sequence 2,ssijij is the score for aligning is the score for aligning ii with with jj,,
wwxx is the score for making a is the score for making a xx long gap in sequence long gap in sequence
1,1,wwyy is the score for making a is the score for making a yy long gap in sequence long gap in sequence
2,2,allowing gaps to be any length in either sequence.allowing gaps to be any length in either sequence.
An oversimplified path matrix example:An oversimplified path matrix example:
total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one here}])here}])
Optimum Alignments —Optimum Alignments —There may be more than one best path through the There may be more than one best path through the matrix (and optimum doesn’t guarantee matrix (and optimum doesn’t guarantee biologically correct). Starting at the top and biologically correct). Starting at the top and working down, then tracing back, the two best working down, then tracing back, the two best trace-back routes define the following two trace-back routes define the following two alignments:alignments:
cTATAtAagg cTATAtAaggcTATAtAagg cTATAtAagg| ||||| and |||||| ||||| and |||||cg.TAtAaT. .cgTAtAaT.cg.TAtAaT. .cgTAtAaT.
With the example’s scoring scheme these alignments have a score With the example’s scoring scheme these alignments have a score of 5, the highest bottom-right score in the trace-back path graph, of 5, the highest bottom-right score in the trace-back path graph, and the sum of six matches minus one interior gap. This is the and the sum of six matches minus one interior gap. This is the number optimized by the algorithm, not any type of a similarity or number optimized by the algorithm, not any type of a similarity or identity percentage, here 75% and 62% respectively! Software will identity percentage, here 75% and 62% respectively! Software will report only one optimal solution.report only one optimal solution.
This was a Needleman Wunsch global solution. Smith Waterman This was a Needleman Wunsch global solution. Smith Waterman style local solutions use negative numbers in the match matrix and style local solutions use negative numbers in the match matrix and pick the best diagonal within the overall graph.pick the best diagonal within the overall graph.
What about proteins — conservative replacements and What about proteins — conservative replacements and
similarity as opposed to identity. The nitrogenous similarity as opposed to identity. The nitrogenous
bases are either the same or they’re not, but amino bases are either the same or they’re not, but amino
acids can be similar, genetically, evolutionarily, and acids can be similar, genetically, evolutionarily, and
structurally! structurally! The BLOSUM62 table ( The BLOSUM62 table (Henikoff and Henikoff, 1992).Henikoff and Henikoff, 1992).
Identity values range from 4 to 11, some similarities are as high as 3, and negative values for those Identity values range from 4 to 11, some similarities are as high as 3, and negative values for those substitutions that rarely occur go as low as –4. The most conserved residue is tryptophan with a substitutions that rarely occur go as low as –4. The most conserved residue is tryptophan with a score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for identity.score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for identity.
AA BB CC DD EE FF GG HH II KK LL MM NN PP QQ RR SS TT VV WW XX YY ZZ
actually follows the ‘Extreme Value distribution’actually follows the ‘Extreme Value distribution’((http://mathworld.wolfram.com/ExtremeValueDistribution.html).http://mathworld.wolfram.com/ExtremeValueDistribution.html).
The Expectation Value!The Expectation Value!The higher the E value is, the more probable that the The higher the E value is, the more probable that the
observed match is due to chance in a search of the observed match is due to chance in a search of the
same size database, and the lower its Z score will be, same size database, and the lower its Z score will be,
i.e. is NOT significant. Therefore, the smaller the E i.e. is NOT significant. Therefore, the smaller the E
value, i.e. the closer it is to zero, the more significant it value, i.e. the closer it is to zero, the more significant it
is and the higher its Z score will be! The E value is the is and the higher its Z score will be! The E value is the
number that really matters. number that really matters. In other words, in order to In other words, in order to
assess whether a given alignment constitutes evidence assess whether a given alignment constitutes evidence
for homology, it helps to know how strong an alignment for homology, it helps to know how strong an alignment
can be expected from chance alone.can be expected from chance alone.
Rules of thumb for a protein search —Rules of thumb for a protein search —
The Z score represents the number of standard deviations some The Z score represents the number of standard deviations some
particular alignment is from a distribution of random alignments particular alignment is from a distribution of random alignments
(often the Normal distribution).(often the Normal distribution).
They They very roughlyvery roughly correspond to the listed E Values (based on correspond to the listed E Values (based on
the Extreme Value distribution) for a typical protein sequence the Extreme Value distribution) for a typical protein sequence
similarity search through a database with ~250,000 protein similarity search through a database with ~250,000 protein
entries.entries.
On to the searches —On to the searches —How can you search the databases for similar How can you search the databases for similar
sequences, if pairwise alignments take Nsequences, if pairwise alignments take N22 time?! time?!
Significance and heuristics . . . Significance and heuristics . . .
Database searching programs use the two concepts of Database searching programs use the two concepts of
dynamic programming and substitution scoring matrices; dynamic programming and substitution scoring matrices;
however, dynamic programming takes far too long when however, dynamic programming takes far too long when
used against most sequence databases with a ‘normal’ used against most sequence databases with a ‘normal’
computer. Remember computer. Remember how bighow big the databases are! the databases are!
Therefore, the programs use tricks to make things Therefore, the programs use tricks to make things
happen faster. These tricks fall into two main categories, happen faster. These tricks fall into two main categories,
that of that of hashinghashing, and that of , and that of approximationapproximation..
Corn beef hash? Huh . . .Corn beef hash? Huh . . .Hashing is the process of breaking your sequence into Hashing is the process of breaking your sequence into
small ‘words’ or ‘k-tuples’ (think all chopped up, just like small ‘words’ or ‘k-tuples’ (think all chopped up, just like
corn beef hash) of a set size and creating a ‘look-up’ corn beef hash) of a set size and creating a ‘look-up’
table with those words keyed to position numbers. table with those words keyed to position numbers.
Computers can deal with numbers way faster than they Computers can deal with numbers way faster than they
can deal with strings of letters, and this preprocessing can deal with strings of letters, and this preprocessing
step happens very quickly.step happens very quickly.
Then when any of the word positions match part of an Then when any of the word positions match part of an
entry in the database, that match, the ‘offset,’ is saved. entry in the database, that match, the ‘offset,’ is saved.
In general, hashing reduces the complexity of the search In general, hashing reduces the complexity of the search
problem from Nproblem from N22 for dynamic programming to N, the for dynamic programming to N, the
length of all the sequences in the database.length of all the sequences in the database.
OK. Heuristics . . . What’s that?OK. Heuristics . . . What’s that?Approximation techniques are collectively known as ‘heuristics.’ Approximation techniques are collectively known as ‘heuristics.’
Webster’s defines heuristic as “serving to guide, discover, or Webster’s defines heuristic as “serving to guide, discover, or
reveal; . . . but unproved or incapable of proof.”reveal; . . . but unproved or incapable of proof.”
In database similarity searching techniques the heuristic usually In database similarity searching techniques the heuristic usually
restricts the necessary search space by calculating some sort of a restricts the necessary search space by calculating some sort of a
statistic that allows the program to decide whether further scrutiny statistic that allows the program to decide whether further scrutiny
of a particular match should be pursued. This statistic may miss of a particular match should be pursued. This statistic may miss
things depending on the parameters set — that’s what makes it things depending on the parameters set — that’s what makes it
heuristic. heuristic. ‘Worthwhile’ results at the end are compiled and the ‘Worthwhile’ results at the end are compiled and the
longest alignment within the program’s restrictions is created.longest alignment within the program’s restrictions is created.
The exact implementation varies between the different programs, The exact implementation varies between the different programs,
but the basic idea follows in most all of them.but the basic idea follows in most all of them.
Two predominant versions exist: BLAST and FastTwo predominant versions exist: BLAST and Fast
Both return local alignments, and are not a single program, but Both return local alignments, and are not a single program, but
rather a family of programs with implementations designed to rather a family of programs with implementations designed to
compare a sequence to a database in about every which way compare a sequence to a database in about every which way
imaginable.imaginable.
These include:These include:
1)1) a DNA sequence against a DNA database (not recommended unless a DNA sequence against a DNA database (not recommended unless
forced to do so because you are dealing with a non-translated region of forced to do so because you are dealing with a non-translated region of
the genome — DNA is just too darn noisy, only identity & four bases!),the genome — DNA is just too darn noisy, only identity & four bases!),
2)2) a translated (where the translation is done ‘on-the-fly’ in all six frames) a translated (where the translation is done ‘on-the-fly’ in all six frames)
version of a DNA sequence against a translated (‘on-the-fly’ six-frame) version of a DNA sequence against a translated (‘on-the-fly’ six-frame)
version of the DNA database (not available in the Fast package),version of the DNA database (not available in the Fast package),
3)3) a translated (‘on-the-fly’ six-frame) version of a DNA sequence against a a translated (‘on-the-fly’ six-frame) version of a DNA sequence against a
protein database,protein database,
4)4) a protein sequence against a translated (‘on-the-fly’ six-frame) version of a protein sequence against a translated (‘on-the-fly’ six-frame) version of
a DNA database,a DNA database,
5)5) or a protein sequence against a protein database.or a protein sequence against a protein database.
2)2) Pre-filters repeat and “low Pre-filters repeat and “low
complexity” sequence complexity” sequence
regions;regions;
4)4) Can find more than one Can find more than one
region of gapped similarity;region of gapped similarity;
5)5) Very fast heuristic and Very fast heuristic and
parallel implementation;parallel implementation;
6)6) Restricted to precompiled, Restricted to precompiled,
specially formatted specially formatted
databases;databases;
FastA — and its family of relatives, FastA — and its family of relatives,
developed by Bill Pearson at the developed by Bill Pearson at the
University of Virginia.University of Virginia.
1)1) Works well for DNA Works well for DNA
against DNA searches against DNA searches
(within limits of possible (within limits of possible
sensitivity);sensitivity);
2)2) Can find only one gapped Can find only one gapped
region of similarity;region of similarity;
3)3) Relatively slow, should Relatively slow, should
often be run in the often be run in the
background;background;
4)4) Does not require specially Does not require specially
prepared, preformatted prepared, preformatted
databases.databases.
The algorithms, in brief —The algorithms, in brief —
BLAST:BLAST:
Fast:Fast:
Two word hits on the Two word hits on the same diagonal above same diagonal above some some similaritysimilarity threshold triggers threshold triggers ungapped extension ungapped extension until the score isn’t until the score isn’t improved enough above improved enough above another threshold:another threshold:
the HSP.the HSP.
Find all ungapped Find all ungapped exact exact word hits; maximize the word hits; maximize the ten best continuous ten best continuous regions’ scores: regions’ scores: init1init1..
Combine non-Combine non-overlapping init overlapping init regions on different regions on different diagonals:diagonals:initninitn..
Use dynamic Use dynamic programming ‘in a programming ‘in a band’ for all regions band’ for all regions with with initninitn scores scores better than some better than some threshold: threshold: optopt score.score.
Initiate gapped extensions Initiate gapped extensions using dynamic programming for using dynamic programming for those HSP’s above a third those HSP’s above a third threshold up to the point where threshold up to the point where the score starts to drop below a the score starts to drop below a fourth threshold: yields fourth threshold: yields alignment.alignment.
BLAST — the algorithm in more detail —BLAST — the algorithm in more detail —1)1) After BLAST has sorted its lookup table, it tries to find all double word After BLAST has sorted its lookup table, it tries to find all double word
hits along the same diagonal within some specified distance using what hits along the same diagonal within some specified distance using what
NCBI calls a Discrete Finite Automaton (DFA). These word hits of size NCBI calls a Discrete Finite Automaton (DFA). These word hits of size
WW do not have to be identical; rather, they have to be better than some do not have to be identical; rather, they have to be better than some
threshold value threshold value TT. To identify these double word hits, the DFA scans . To identify these double word hits, the DFA scans
through all strings of words (typically through all strings of words (typically WW=3 for peptides) that score at =3 for peptides) that score at
least least TT (usually 11 for peptides). (usually 11 for peptides).
2)2) Each double word hit that passes this step then triggers a process called Each double word hit that passes this step then triggers a process called
un-gapped extension in both directions, such that each diagonal is un-gapped extension in both directions, such that each diagonal is
extended as far as it can, until the running score starts to drop below a extended as far as it can, until the running score starts to drop below a
pre-defined value pre-defined value XX within a certain range within a certain range AA. The result of this pass is . The result of this pass is
called a High-Scoring segment Pair or HSP.called a High-Scoring segment Pair or HSP.
3)3) Those HSPs that pass this step with a score better than Those HSPs that pass this step with a score better than SS then begin a then begin a
gapped extension step utilizing dynamic programming. Those gapped gapped extension step utilizing dynamic programming. Those gapped
alignments with Expectation values better than the user specified cutoff alignments with Expectation values better than the user specified cutoff
are reported. The extreme value distribution of BLAST Expectation are reported. The extreme value distribution of BLAST Expectation
values is precomputed against each precompiled database — this is one values is precomputed against each precompiled database — this is one
area that speeds up the algorithm considerably.area that speeds up the algorithm considerably.
The BLAST algorithm, continued —The BLAST algorithm, continued —The math generalizes thus: for any two sequences of length The math generalizes thus: for any two sequences of length
mm and and nn, local, best alignments are identified as HSPs. , local, best alignments are identified as HSPs.
HSPs are stretches of sequence pairs that cannot be further HSPs are stretches of sequence pairs that cannot be further
improved by extension or trimming, as described above. For improved by extension or trimming, as described above. For
ungapped alignments, the number of expected HSPs with a ungapped alignments, the number of expected HSPs with a
score of at least score of at least SS is given by the formula: is given by the formula:
E = KmneE = Kmness
This is the This is the EE-value for the score -value for the score SS. In a database search . In a database search nn is is
the size of the database in residues, so the size of the database in residues, so NN==mnmn is the search is the search
space size. space size. KK and and are supplied by statistical theory, and, are supplied by statistical theory, and,
as mentioned above, can be calculated by comparison to as mentioned above, can be calculated by comparison to
precomputed, simulated distributions. These two parameters precomputed, simulated distributions. These two parameters
define the statistical significance of an define the statistical significance of an EE-value.-value.
The Fast algorithm — in more detail —The Fast algorithm — in more detail —Fast is an older algorithm than BLAST. The original Fast paper Fast is an older algorithm than BLAST. The original Fast paper
came out in 1988, based on David Lipman’s work in a 1983 paper; came out in 1988, based on David Lipman’s work in a 1983 paper;
the original BLAST paper was published in 1990. Both algorithms the original BLAST paper was published in 1990. Both algorithms
have been upgraded substantially since originally released. have been upgraded substantially since originally released.
Fast was the first widely used, powerful sequence database Fast was the first widely used, powerful sequence database
searching algorithm. Bill Pearson continually refines the programs searching algorithm. Bill Pearson continually refines the programs
such that they remain a viable alternative to BLAST, especially if such that they remain a viable alternative to BLAST, especially if
one is restricted to searching DNA against DNA without translation. one is restricted to searching DNA against DNA without translation.
They are also very helpful in situations where BLAST finds no They are also very helpful in situations where BLAST finds no
significant alignments — arguably, Fast may be more sensitive than significant alignments — arguably, Fast may be more sensitive than
BLAST in these situations.BLAST in these situations.
Fast is also a hashing style algorithm and builds words of a set k-Fast is also a hashing style algorithm and builds words of a set k-
tuple size, by default two for peptides. It then identifies all exact tuple size, by default two for peptides. It then identifies all exact
word matches between the sequence and the database members. word matches between the sequence and the database members.
Note that the word matches must be exact for Fast and only similar, Note that the word matches must be exact for Fast and only similar,
above some threshold, for BLAST.above some threshold, for BLAST.
The Fast algorithm, continued —The Fast algorithm, continued —From these exact word matches:From these exact word matches:
1)1) Scores are assigned to each continuous, ungapped, diagonal by Scores are assigned to each continuous, ungapped, diagonal by
adding all of the exact match BLOSUM values.adding all of the exact match BLOSUM values.
2)2) The ten highest scoring diagonals for each query-database pair The ten highest scoring diagonals for each query-database pair
are then rescored using BLOSUM similarities as well as identities are then rescored using BLOSUM similarities as well as identities
and ends are trimmed to maximize the score. The best of each and ends are trimmed to maximize the score. The best of each
of these is called the of these is called the Init1Init1 score. score.
3)3) Next the program ‘looks’ around to see if nearby off-diagonal Next the program ‘looks’ around to see if nearby off-diagonal Init1Init1
alignments can be combined by incorporating gaps. If so, a new alignments can be combined by incorporating gaps. If so, a new
score, score, InitnInitn, is calculated by summing up all the contributing , is calculated by summing up all the contributing Init1Init1
scores, penalizing gaps with a penalty for each.scores, penalizing gaps with a penalty for each.
4)4) The program then constructs an optimal local alignment for all The program then constructs an optimal local alignment for all
InitnInitn pairs with scores better than some set threshold using a pairs with scores better than some set threshold using a
variation of dynamic programming “in a band.” A sixteen residue variation of dynamic programming “in a band.” A sixteen residue
band centered at the highest band centered at the highest Init1Init1 region is used by default with region is used by default with
peptides. The score generated from this step called peptides. The score generated from this step called optopt..
The Fast algorithm, still continued —The Fast algorithm, still continued —5)5) Next, Fast uses a simple linear regression against the natural Next, Fast uses a simple linear regression against the natural
log of the search set sequence length to calculate a normalized log of the search set sequence length to calculate a normalized
z-score for the sequence pair. Note that this is not the same z-score for the sequence pair. Note that this is not the same
Monte Carlo style Z score described earlier, and can not be Monte Carlo style Z score described earlier, and can not be
directly compared to one. directly compared to one.
6)6) Finally, it compares the distribution of these z-scores to the Finally, it compares the distribution of these z-scores to the
actual extreme-value distribution of the searchactual extreme-value distribution of the search. Using this . Using this
distribution, the program estimates the number of sequences distribution, the program estimates the number of sequences
that would be expected to have, purely by chance, a z-score that would be expected to have, purely by chance, a z-score
greater than or equal to the z-score obtained in the search. This greater than or equal to the z-score obtained in the search. This
is reported as the Expectation value. is reported as the Expectation value.
7)7) If the user requests pair-wise alignments in the output, then the If the user requests pair-wise alignments in the output, then the
program uses full Smith-Waterman local dynamic programming, program uses full Smith-Waterman local dynamic programming,
not ‘restricted to a band,’ to produce its final alignments.not ‘restricted to a band,’ to produce its final alignments.
What’s the deal with DNA versus protein for What’s the deal with DNA versus protein for searches and alignment?searches and alignment?
All database similarity searching and sequence alignment, All database similarity searching and sequence alignment,
regardless of the algorithm used, is far more sensitive at the amino regardless of the algorithm used, is far more sensitive at the amino
acid level than at the DNA level. This is because proteins have acid level than at the DNA level. This is because proteins have
twenty match criteria versus DNA’s four, and those four DNA twenty match criteria versus DNA’s four, and those four DNA
bases can generally only be identical, not similar, to each other; bases can generally only be identical, not similar, to each other;
and many DNA base changes (especially third position changes) and many DNA base changes (especially third position changes)
do not change the encoded protein.do not change the encoded protein.
All of these factors drastically increase the ‘noise’ level of a DNA All of these factors drastically increase the ‘noise’ level of a DNA
against DNA search, and give protein searches a much greater against DNA search, and give protein searches a much greater
‘look-back’ time, at least doubling it. ‘look-back’ time, at least doubling it.
Therefore, whenever dealing with coding sequence, it is always Therefore, whenever dealing with coding sequence, it is always
prudent to search at the protein level!prudent to search at the protein level!
So what; why even bother?So what; why even bother?More data yields stronger analyses — as More data yields stronger analyses — as long as it is done carefully!long as it is done carefully!
Mosaic ideas and evolutionary ‘importance.’Mosaic ideas and evolutionary ‘importance.’
Applications:Applications:
Probe, primer, and motif design;Probe, primer, and motif design;
All right — how do you do it?All right — how do you do it?
On to multiple sequence alignment & analysis —On to multiple sequence alignment & analysis —
Dynamic programming’s complexity Dynamic programming’s complexity increases exponentially with the number of increases exponentially with the number of sequences being compared:sequences being compared:
N-dimensional matrix . . . .N-dimensional matrix . . . .complexity=[sequence length]complexity=[sequence length]number of sequencesnumber of sequences
i.e. complexity is i.e. complexity is OO((eenn))
Use different types of ‘tricks.’ See —Use different types of ‘tricks.’ See —
MSA (‘global’ within ‘bounding box’) andMSA (‘global’ within ‘bounding box’) and
incredibly important, especially with incredibly important, especially with
sequences that have areas of high and sequences that have areas of high and
low similaritylow similarity
There’s a bewildering assortment of bioinformatics databases and ways to There’s a bewildering assortment of bioinformatics databases and ways to access and manipulate the information within them. The key is to learn access and manipulate the information within them. The key is to learn how to use the data and the methods in the most efficient mannerhow to use the data and the methods in the most efficient manner! The ! The better you understand the chemical, physical, and biological systems better you understand the chemical, physical, and biological systems involved, the better your chance of success in analyzing them. Certain involved, the better your chance of success in analyzing them. Certain strategies are inherently more appropriate to others in certain strategies are inherently more appropriate to others in certain circumstances. Making these types of subjective, discriminatory decisions circumstances. Making these types of subjective, discriminatory decisions is one of the most important ‘take-home’ messages I can offer!is one of the most important ‘take-home’ messages I can offer!
Gunnar von Heijne in his old but incredibly readable treatise, Gunnar von Heijne in his old but incredibly readable treatise, Sequence Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), (1987), provides a very appropriate conclusion:provides a very appropriate conclusion:
““Think about what you’re doing; use your knowledge of the molecular Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and direction of inquiry; use as much information as possible; and do not do not blindly accept everything the computer offers youblindly accept everything the computer offers you.”.”
““. . . if any lesson is to be drawn . . . it surely is that to be able to make . . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and a useful contribution one must first and foremost be a biologist, and only second a theoretician . . . . We have to develop better algorithms, only second a theoretician . . . . We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and we have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.”above all we have to become better biologists. But that’s all it takes.”
Conclusions —Conclusions —
References —References —Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular BiologyJournal of Molecular Biology 215, 403-410. 215, 403-410.Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New
Generation of Protein Database Search Programs. Generation of Protein Database Search Programs. Nucleic Acids ResearchNucleic Acids Research 25, 3389-3402. 25, 3389-3402.Bailey, T.L. and Elkan, C., (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers, in Bailey, T.L. and Elkan, C., (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers, in Proceedings of the Second Proceedings of the Second
International Conference on Intelligent Systems for Molecular BiologyInternational Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, California, U.S.A. pp. 28–36., AAAI Press, Menlo Park, California, U.S.A. pp. 28–36.Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids ResearchNucleic Acids Research 20, 2013-2018. 20, 2013-2018.Eddy, S.R. (1996) Hidden Markov models. Eddy, S.R. (1996) Hidden Markov models. Current Opinion in Structural BiologyCurrent Opinion in Structural Biology 6, 361–365. 6, 361–365.Eddy, S.R. (1998) Profile hidden Markov models. Eddy, S.R. (1998) Profile hidden Markov models. BioinformaticsBioinformatics 14, 755--763 14, 755--763Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington,
Seattle, Washington, U.S.A.Seattle, Washington, U.S.A.Feng, D.F. and Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Feng, D.F. and Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular EvolutionJournal of Molecular Evolution 25, 25,
351–360 .351–360 .Genetics Computer Group (GCG) (Copyright 1982-2007) Genetics Computer Group (GCG) (Copyright 1982-2007) Program Manual for the Wisconsin PackageProgram Manual for the Wisconsin Package, Version 10., Accelrys, Inc. A Pharmocopeia , Version 10., Accelrys, Inc. A Pharmocopeia
Company, San Diego, California, U.S.A.Company, San Diego, California, U.S.A.Gilbert, D.G. (1993 [C release] and 1999 [Java release]) ReadSeq, public domain software distributed by the author.Gilbert, D.G. (1993 [C release] and 1999 [Java release]) ReadSeq, public domain software distributed by the author.
http://iubio.bio.indiana.edu/soft/molbio/readseq/http://iubio.bio.indiana.edu/soft/molbio/readseq/ Bioinformatics Group, Biology Department, Indiana University, Bloomington, Indiana,U.S.A. Bioinformatics Group, Biology Department, Indiana University, Bloomington, Indiana,U.S.A.Gribskov, M. and Devereux, J., editors (1992) Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis PrimerSequence Analysis Primer. W.H. Freeman and Company, New York, New York, U.S.A.. W.H. Freeman and Company, New York, New York, U.S.A.Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A.Proc. Natl. Acad. Sci. U.S.A. 84, 4355-4358. 84, 4355-4358.Gupta, S.K., Kececioglu, J.D., and Schaffer, A.A. (1995) Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs Gupta, S.K., Kececioglu, J.D., and Schaffer, A.A. (1995) Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs
multiple sequence alignment. multiple sequence alignment. Journal of Computational BiologyJournal of Computational Biology 2, 459–472. 2, 459–472.Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences U.S.A.Proceedings of the National Academy of Sciences U.S.A.
89, 10915-10919.89, 10915-10919.Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.
Journal of Molecular BiologyJournal of Molecular Biology 48, 443-453. 48, 443-453.Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences U.S.A.Proceedings of the National Academy of Sciences U.S.A. 85, 85,
2444-2448.2444-2448.Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and StructureAtlas of Protein Sequences and Structure, (M.O. Dayhoff , (M.O. Dayhoff
editor) 5, Suppl. 3, 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A.editor) 5, Suppl. 3, 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A.Smith, R.F. and Smith, T.F. (1992) Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties Smith, R.F. and Smith, T.F. (1992) Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties
for comparative protein modelling. for comparative protein modelling. Protein EngineeringProtein Engineering 5, 35–41. 5, 35–41.Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied MathematicsAdvances in Applied Mathematics 2, 482-489. 2, 482-489.Swofford, D.L., PAUP* (Phylogenetic Analysis Using Parsimony and other methods) version 4.0+ (1989–2007) Florida State University, Tallahassee, Swofford, D.L., PAUP* (Phylogenetic Analysis Using Parsimony and other methods) version 4.0+ (1989–2007) Florida State University, Tallahassee,
Florida, U.S.A. Florida, U.S.A. http://paup.csit.fsu.edu/http://paup.csit.fsu.edu/ distributed through Sinaeur Associates, Inc. distributed through Sinaeur Associates, Inc. http://www.sinauer.com/http://www.sinauer.com/ Sunderland, Massachusetts, U.S.A. Sunderland, Massachusetts, U.S.A.Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins, D.G. (1997) The ClustalX windows interface: flexible strategies for multiple Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins, D.G. (1997) The ClustalX windows interface: flexible strategies for multiple
sequence alignment aided by quality analysis tools. sequence alignment aided by quality analysis tools. Nucleic Acids ResearchNucleic Acids Research 24, 4876–4882. 24, 4876–4882.Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through
sequence weighting, positions-specific gap penalties and weight matrix choice. sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids ResearchNucleic Acids Research, 22, 4673-4680., 22, 4673-4680.Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Proceedings of the National Academy of