A BioInformatics Survey... some taste of theory, and a few practicalities Steve Thompson Steve Thompson Florida State University School of Computational.

A BioInformatics SurveyA BioInformatics Survey . . . . . . some taste of theory, and some taste of theory, and

a few practicalitiesa few practicalities

Steve ThompsonSteve Thompson

Florida State University School of Florida State University School of Computational Science (SCS)Computational Science (SCS)

BCH 5405BCH 5405

Molecular Biology & BiotechnologyMolecular Biology & Biotechnology

Dr. Qing-Xiang (Amy) SangDr. Qing-Xiang (Amy) Sang

Mon. & Wed., Mon. & Wed., March 24 & 26, 2008March 24 & 26, 2008

To begin,To begin,some terminology —some terminology —

What is bioinformatics, What is bioinformatics,

genomics, proteomics, genomics, proteomics,

sequence analysis, sequence analysis,

computational molecular computational molecular

biology . . . ?biology . . . ?

My definitions, My definitions, lots of overlaplots of overlap — —BiocomputingBiocomputing and and computational biologycomputational biology are synonyms and are synonyms and

describe the use of computers and computational techniques describe the use of computers and computational techniques

to analyze any type of a biological system, from individual to analyze any type of a biological system, from individual

molecules to organisms to overall ecology.molecules to organisms to overall ecology.

BioinformaticsBioinformatics describes using computational techniques to describes using computational techniques to

access, analyze, and interpret the biological information in access, analyze, and interpret the biological information in

any type of biological database.any type of biological database.

Sequence analysisSequence analysis is the study of molecular sequence data for is the study of molecular sequence data for

the purpose of inferring the function, interactions, evolution, the purpose of inferring the function, interactions, evolution,

and perhaps structure of biological molecules.and perhaps structure of biological molecules.

GenomicsGenomics analyzes the context of genes or complete genomes analyzes the context of genes or complete genomes

(the total DNA content of an organism) within the same and/or (the total DNA content of an organism) within the same and/or

across different genomes.across different genomes.

ProteomicsProteomics is the subdivision of genomics concerned with is the subdivision of genomics concerned with

analyzing the complete protein complement, i.e. the proteome, analyzing the complete protein complement, i.e. the proteome,

of organisms, both within and between different organisms.of organisms, both within and between different organisms.

And one way to think about it —And one way to think about it —the Reverse Biochemistry Analogythe Reverse Biochemistry AnalogyBiochemists no longer have to begin a research Biochemists no longer have to begin a research

project by isolating and purifying massive amounts project by isolating and purifying massive amounts

of a protein from its native organism in order to of a protein from its native organism in order to

characterize a particular gene product. Rather, characterize a particular gene product. Rather,

now scientists can amplify a section of some now scientists can amplify a section of some

genome based on its similarity to other genomes, genome based on its similarity to other genomes,

sequence that piece of DNA and, sequence that piece of DNA and, using sequence using sequence

analysis tools, infer all sorts of functional, analysis tools, infer all sorts of functional,

evolutionary, and, perhaps, structural insight into evolutionary, and, perhaps, structural insight into

that stretch of DNA!that stretch of DNA!

The computer and molecular databases are a The computer and molecular databases are a

necessary, integral part of this entire process.necessary, integral part of this entire process.

The exponential growth of molecular sequence databases

YearYear BasePairs BasePairs

SequencesSequences

19821982 680338 680338

606606

19831983 2274029 2274029

24272427

19841984 3368765 3368765

41754175

19851985 5204420 5204420

57005700

19861986 9615371 9615371

99789978

19871987 1551477615514776

1458414584

19881988 23800000 23800000

2057920579

19891989 34762585 34762585

2879128791

19901990 49179285 49179285

3953339533

19911991 71947426 71947426

5562755627

19921992 101008486 101008486

7860878608

19931993 157152442 157152442

143492143492

19941994 217102462 217102462

215273215273

19951995 384939485 384939485

555694555694

19961996 651972984 651972984

10212111021211

19971997 1160300687 1160300687

17658471765847

19981998 2008761784 2008761784

28378972837897

19991999 3841163011 3841163011

4864570 4864570

20002000 1110106628811101066288

1010602310106023

20012001 1584992143815849921438

1497631014976310

20022002 2850799016628507990166

2231888322318883

20032003 3655336848536553368485

3096841830968418

20042004 4457574517644575745176

4060431940604319

20052005 5603773446256037734462

5201676252016762

20062006 6901929070569019290705

6489374764893747

20072007 8387417973083874179730

8038838280388382

& cpu power —& cpu power —

Doubling time about a year and half!Doubling time about a year and half!

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.htmlhttp://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

Sequence database growth (cont.) —Sequence database growth (cont.) —

The International Human Genome Sequencing The International Human Genome Sequencing

Consortium announced the completion of the "Working Consortium announced the completion of the "Working

Draft" of the human genome in June 2000; Draft" of the human genome in June 2000;

independently that same month, the private company independently that same month, the private company

Celera GenomicsCelera Genomics announced that it had completed the announced that it had completed the

first “Assembly” of the human genome. The classic first “Assembly” of the human genome. The classic

articles were published mid-February 2001 in the articles were published mid-February 2001 in the

journals journals ScienceScience and and NatureNature. .

Genome projects have kept the data coming at an Genome projects have kept the data coming at an

incredible rate. incredible rate. Currently around 50 Archaea, 600 Currently around 50 Archaea, 600

Bacteria, and 20 Eukaryote complete genomes, and 200 Bacteria, and 20 Eukaryote complete genomes, and 200

Eukaryote assemblies are represented, not counting the Eukaryote assemblies are represented, not counting the

almost 3,000 virus and viroid genomes available.almost 3,000 virus and viroid genomes available.

Some neat stuff from the human genome papers —Some neat stuff from the human genome papers —

Homo sapiensHomo sapiens, aren’t nearly as special as we once , aren’t nearly as special as we once thought. Of the 3.2 billion base pairs in our DNA:thought. Of the 3.2 billion base pairs in our DNA:

Traditional gene number estimates were often in the Traditional gene number estimates were often in the 100,000 range; turns out we’ve only got about twice 100,000 range; turns out we’ve only got about twice as many as a fruit fly, between 25’ and 30,000!as many as a fruit fly, between 25’ and 30,000!

The protein coding region of the genome is only about The protein coding region of the genome is only about 1% or so, a bunch of the remainder is ‘jumping,’ 1% or so, a bunch of the remainder is ‘jumping,’ ‘junk,’ ‘selfish DNA,’ much of which may be involved ‘junk,’ ‘selfish DNA,’ much of which may be involved in regulation and control.in regulation and control.

Some 100-200 genes were transferred from an Some 100-200 genes were transferred from an ancestral bacterial genome to an ancestral ancestral bacterial genome to an ancestral vertebrate genome!vertebrate genome!((Later shown to be false by more extensive analyses, and Later shown to be false by more extensive analyses, and to be due to gene loss not transferto be due to gene loss not transfer.).)

NCBI’s ’s

Entrez Entrez

Sequence databases are an organized way to store exponentially Sequence databases are an organized way to store exponentially

accumulating sequence data. An accumulating sequence data. An ‘alphabet soup’ of t‘alphabet soup’ of three major hree major

organizations maintain them. They largely ‘mirror’ one another and organizations maintain them. They largely ‘mirror’ one another and

share accession codes, but NOT proper identifier names:share accession codes, but NOT proper identifier names:

North America: the National Center for Biotechnology Information (North America: the National Center for Biotechnology Information (

NCBI), a division of the National Library of Medicine (NLM), at the ), a division of the National Library of Medicine (NLM), at the

National Institute of Health (NIH), maintains the National Institute of Health (NIH), maintains the GenBank (& WGS) (& WGS)

nucleotide, GenPept amino acid, and RefSeq genome, nucleotide, GenPept amino acid, and RefSeq genome,

transcriptome, and proteome databases.transcriptome, and proteome databases.

Europe: the European Molecular Biology Laboratory (Europe: the European Molecular Biology Laboratory (EMBL), the ), the

European Bioinformatics Institute (European Bioinformatics Institute (EBI), and the ), and the Swiss Institute of Swiss Institute of

Bioinformatics (SIB) Bioinformatics (SIB) all help maintain theall help maintain the EMBL nucleotide nucleotide

sequence database, andsequence database, and the UNIPROT ( the UNIPROT (SWISS-PROT + + TrEMBL)

amino acid sequence database (with USA PIR/NBRF support also).amino acid sequence database (with USA PIR/NBRF support also).

Asia: TAsia: The National Institute of Genetics (NIG) supports the National Institute of Genetics (NIG) supports the he Center Center

for Information Biology’s (CIG) for Information Biology’s (CIG) DNA Data Bank of Japan (DNA Data Bank of Japan (DDBJ). ).

Let’s start with sequence databases —Let’s start with sequence databases —

A little history —A little history —The first well recognized sequence database was Dr. The first well recognized sequence database was Dr.

Margaret Dayhoff’s hardbound Margaret Dayhoff’s hardbound Atlas of Protein Atlas of Protein

Sequence and StructureSequence and Structure begun in the mid-sixties. begun in the mid-sixties.

That became PIR. That became PIR. DDBJDDBJ began in 1984, began in 1984, GenBankGenBank

in 1982, and in 1982, and EMBLEMBL in 1980. They are all attempts at in 1980. They are all attempts at

establishing an organized, reliable, comprehensive, establishing an organized, reliable, comprehensive,

and openly available library of genetic sequences.and openly available library of genetic sequences.

Sequence databases have long-since outgrown a Sequence databases have long-since outgrown a

hardbound atlas that you can pull off of a library shelf. hardbound atlas that you can pull off of a library shelf.

They have become gargantuan and have evolved They have become gargantuan and have evolved

through many, many changes.through many, many changes.

What are sequence databases like?What are sequence databases like?

Just what are primary sequences?Just what are primary sequences?

(Central Dogma: DNA —> RNA —> protein)(Central Dogma: DNA —> RNA —> protein)

Primary refers to one dimension — all of the ‘symbol’ information Primary refers to one dimension — all of the ‘symbol’ information

written in sequential order necessary to specify a particular written in sequential order necessary to specify a particular

biological molecular entity, be it polypeptide or nucleotide.biological molecular entity, be it polypeptide or nucleotide.

The symbols are the one letter codes for all of the biological The symbols are the one letter codes for all of the biological

nitrogenous bases and amino acid residues and their ambiguity nitrogenous bases and amino acid residues and their ambiguity

codes. Biological carbohydrates, lipids, and structural and codes. Biological carbohydrates, lipids, and structural and

functional information are not sequence data. Not even DNA functional information are not sequence data. Not even DNA

CDS translations in a DNA database are sequence data!CDS translations in a DNA database are sequence data!

However, much of this feature and bibliographic type information is However, much of this feature and bibliographic type information is

available in the reference documentation sections associated available in the reference documentation sections associated

with primary sequences in the databases.with primary sequences in the databases.

Sequence database installations are commonly a Sequence database installations are commonly a

complex ASCII/Binary mix, and Web-based ones are complex ASCII/Binary mix, and Web-based ones are

often relational or Object Oriented. They usually often relational or Object Oriented. They usually

consist of several very long text files each containing consist of several very long text files each containing

different types of related information, such as all of the different types of related information, such as all of the

sequences themselves, versus all of the title lines, or sequences themselves, versus all of the title lines, or

all of the reference sections. Binary files often help all of the reference sections. Binary files often help

‘glue together’ all of these other files by providing ‘glue together’ all of these other files by providing

indexing functions. indexing functions.

Software is required to successfully interact with these Software is required to successfully interact with these

databases, and access is most easily handled through databases, and access is most easily handled through

various software packages and interfaces, on the various software packages and interfaces, on the

World Wide Web or otherwise. World Wide Web or otherwise.

Sequence database content —Sequence database content —

Sequence database organization —Sequence database organization —

Nucleic Acid DB’sNucleic Acid DB’s

GenBank/EMBL/DDBJGenBank/EMBL/DDBJ

all Taxonomic all Taxonomic

categories +categories +

WGS, HTC & HTG +WGS, HTC & HTG +

STS, EST, & GSS, STS, EST, & GSS,

a.k.a.a.k.a. “Tags” “Tags”

Amino Acid DB’sAmino Acid DB’s

UNIPROT =UNIPROT =

SWISS-SWISS-PROT +PROT +

TrEMBL (with TrEMBL (with help from PIR)help from PIR)

GenpeptGenpept

Nucleic acid sequence databases are split into subdivisions based Nucleic acid sequence databases are split into subdivisions based

on taxonomy and data type. TrEMBL sequences are merged into on taxonomy and data type. TrEMBL sequences are merged into

SWISS-PROT as they receive increased levels of annotation. SWISS-PROT as they receive increased levels of annotation.

Both together comprise UNIPROT. GenPept has minimal Both together comprise UNIPROT. GenPept has minimal

annotation.annotation.

Important Important elementselements associated with each sequence entry: associated with each sequence entry:NameName: LOCUS, ENTRY, ID, all are unique identifiers.: LOCUS, ENTRY, ID, all are unique identifiers.DefinitionDefinition: : a.k.a.a.k.a. title, a brief textual sequence description. title, a brief textual sequence description.Accession NumberAccession Number: a constant data identifier.: a constant data identifier.Source and taxonomy information;Source and taxonomy information;complete literature references;complete literature references;comments and keywords; and the all important comments and keywords; and the all important FEATUREFEATURE table!table!A summary or checksum line, and the A summary or checksum line, and the sequencesequence itself. itself.

ButBut::Each major database as well as each major suite of software Each major database as well as each major suite of software tools has its own distinct format requirements. Changes over tools has its own distinct format requirements. Changes over the years are a huge hassle. Standards are argued, e.g. XML, the years are a huge hassle. Standards are argued, e.g. XML, but unfortunately, until all biologists and computer scientists but unfortunately, until all biologists and computer scientists worldwide agree on one standard, and all software is (re)written worldwide agree on one standard, and all software is (re)written to that standard, neither of which is likely to happen very to that standard, neither of which is likely to happen very quickly, if ever, format issues will remain quickly, if ever, format issues will remain one of the most one of the most confusing and troublingconfusing and troubling aspects of working with sequence data. aspects of working with sequence data. Specialized format conversion tools expedite the chore, but Specialized format conversion tools expedite the chore, but becoming familiar with some of the common formats helps a lot.becoming familiar with some of the common formats helps a lot.

Parts and problems —Parts and problems —

More format complications —More format complications —

Indels and missing Indels and missing

data symbols (i.e. data symbols (i.e.

gaps) designation gaps) designation

discrepancy discrepancy

headaches —headaches —

., -, ~, ?, N, or X., -, ~, ?, N, or X

. . . . . Help!. . . . . Help!

Specialized ‘sequence’ -type databases —Specialized ‘sequence’ -type databases —Databases that contain special types of sequence Databases that contain special types of sequence

information, such as patterns, motifs, and profiles. information, such as patterns, motifs, and profiles.

These include: These include: REBASEREBASE, , EPDEPD, , PROSITEPROSITE, , BLOCKSBLOCKS, ,

ProDomProDom, , PfamPfam . . . . . . . .

Databases that contain multiple sequence entries Databases that contain multiple sequence entries

aligned, e.g. aligned, e.g. PopSetPopSet, , RDPRDP and and ALNALN..

Databases that contain families of sequences ordered Databases that contain families of sequences ordered

functionally, structurally, or phylogenetically, e.g. functionally, structurally, or phylogenetically, e.g.

iProClassiProClass and and HOVERGENHOVERGEN..

Databases of species specific sequences, e.g. the HIV Databases of species specific sequences, e.g. the HIV

Database and the Database and the Giardia lambliaGiardia lamblia Genome Project. Genome Project.

And on and on . . . . See Amos Bairoch’s excellent links And on and on . . . . See Amos Bairoch’s excellent links

page: http://us.expasy.org/alinks.html.page: http://us.expasy.org/alinks.html.

What about other types of biological databases? Three-dimensional structure databases —

the Protein Data Bank and Rutgers Nucleic Acid Database.the Protein Data Bank and Rutgers Nucleic Acid Database.

And see Molecules to Go at And see Molecules to Go at http://molbio.info.nih.gov/cgi-bin/pdb/.http://molbio.info.nih.gov/cgi-bin/pdb/.

These databases contain all of the 3D atomic coordinate data These databases contain all of the 3D atomic coordinate data

necessary to define the tertiary shape of a particular biological necessary to define the tertiary shape of a particular biological

molecule. The data is usually experimentally derived, either by X-molecule. The data is usually experimentally derived, either by X-

ray crystallography or by NMR, sometimes it’s hypothetical. The ray crystallography or by NMR, sometimes it’s hypothetical. The

source of the structure and its resolution is always given.source of the structure and its resolution is always given.

Secondary structure boundaries, sequence data, and reference Secondary structure boundaries, sequence data, and reference

information are often associated with the coordinate data, but it is information are often associated with the coordinate data, but it is

the 3D data that really matters, not the annotation.the 3D data that really matters, not the annotation.

Molecular visualization or modeling software is required to interact Molecular visualization or modeling software is required to interact

with the data. It has little meaning on its own.with the data. It has little meaning on its own.

And still other types of bioinfo’ databases —And still other types of bioinfo’ databases —Consider these ‘non-molecular’ but they often link to molecules:Consider these ‘non-molecular’ but they often link to molecules:

Reference DatabasesReference Databases (all w/ pointers to sequences): e.g. (all w/ pointers to sequences): e.g.

LocusLink/Gene — integrated knowledge baseLocusLink/Gene — integrated knowledge base

OMIM — Online Mendelian Inheritance in ManOMIM — Online Mendelian Inheritance in Man

PubMed/MedLine — over 11 million citations PubMed/MedLine — over 11 million citations

from more than 4 thousand bio/medical from more than 4 thousand bio/medical

scientific journals. scientific journals.

Phylogenetic Tree DatabasesPhylogenetic Tree Databases: e.g. the Tree of Life.: e.g. the Tree of Life.

Metabolic Pathway DatabasesMetabolic Pathway Databases: e.g. WIT (What Is There), : e.g. WIT (What Is There),

Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of

Genes and Genomes), and the human Reactome.Genes and Genomes), and the human Reactome.

Population studies dataPopulation studies data — which strains, where, etc. — which strains, where, etc.

And then databases that many biocomputing people don’t even And then databases that many biocomputing people don’t even

usually consider: e.g. GIS/GPS/remote sensing data, medical usually consider: e.g. GIS/GPS/remote sensing data, medical

records, census counts, mortality and birth rates . . . .records, census counts, mortality and birth rates . . . .

Enter pairwise alignment, Enter pairwise alignment,

similarity searching, similarity searching,

significance, and significance, and

homology.homology.

OK, given your own experimentally derived OK, given your own experimentally derived

nucleotide or amino acid sequence, or one nucleotide or amino acid sequence, or one

that you’ve found in a database, what more that you’ve found in a database, what more

can we learn about its biological function?can we learn about its biological function?

First, just what is homology and First, just what is homology and

similarity — are they the same?similarity — are they the same?

Don’t confuse homology with similarity: Don’t confuse homology with similarity:

there is a huge difference! Similarity is a there is a huge difference! Similarity is a

statistic that describes how much two statistic that describes how much two

(sub)sequences are alike according to (sub)sequences are alike according to

some set scoring criteria. It can be some set scoring criteria. It can be

normalized to ascertain statistical normalized to ascertain statistical

significance, but it’s still just a number.significance, but it’s still just a number.

implies an evolutionary relationship — more than just implies an evolutionary relationship — more than just

everything evolving from the same primordial ‘ooze.’ everything evolving from the same primordial ‘ooze.’

Reconstruct the phylogeny of the organisms or genes of Reconstruct the phylogeny of the organisms or genes of

interest to demonstrate homology. Better yet, show interest to demonstrate homology. Better yet, show

experimental evidence — structural, morphological, experimental evidence — structural, morphological,

genetic, and/or fossil — that corroborates your claim.genetic, and/or fossil — that corroborates your claim.

There is no such thing as percent homology; something There is no such thing as percent homology; something

is either homologous or it is not. Walter Fitch said is either homologous or it is not. Walter Fitch said

“homology is like pregnancy — you can’t be 45% “homology is like pregnancy — you can’t be 45%

pregnant, just like something can’t be 45% homologous. pregnant, just like something can’t be 45% homologous.

You either are or you are not.” Highly significant You either are or you are not.” Highly significant

similarity can argue for homology, and not the inverse.similarity can argue for homology, and not the inverse.

Homology, in contrast and by definition,Homology, in contrast and by definition,

One way — dot matrices.One way — dot matrices.

Provide a ‘Gestalt’ of all Provide a ‘Gestalt’ of all

possible alignments between two possible alignments between two

sequences.sequences.

To begin — very simple 0, 1 To begin — very simple 0, 1

(match, nomatch) identity scoring (match, nomatch) identity scoring

function.function.

Put a dot wherever symbols match.Put a dot wherever symbols match.

OK, so how can we see if two OK, so how can we see if two

sequences are similar? First, to sequences are similar? First, to

introduce the concept, a graphical introduce the concept, a graphical

method . . . method . . .

Identities and insertion/deletion Identities and insertion/deletion

events (indels) identified (zero:one events (indels) identified (zero:one

match score matrix, no window).match score matrix, no window).

Noise due to random composition effects contributes to confusion. To ‘clean up’ Noise due to random composition effects contributes to confusion. To ‘clean up’ the plot consider a filtered windowing approach. A dot is placed at the middle of the plot consider a filtered windowing approach. A dot is placed at the middle of a window if some ‘stringency’ is met within that defined window size. Then the a window if some ‘stringency’ is met within that defined window size. Then the window is shifted one position and the entire process is repeated window is shifted one position and the entire process is repeated (zero:one (zero:one match score, match score, window of size three and a stringency level of two out of threewindow of size three and a stringency level of two out of three).).

We can compare one molecule against another by We can compare one molecule against another by

aligning them. However, a ‘brute force’ approach just aligning them. However, a ‘brute force’ approach just

won’t work. Even without considering the introduction of won’t work. Even without considering the introduction of

gaps, the computation required to compare all possible gaps, the computation required to compare all possible

alignments between two sequences requires time alignments between two sequences requires time

proportional to the product of the lengths of the two proportional to the product of the lengths of the two

sequences. Therefore, if the two sequences are sequences. Therefore, if the two sequences are

approximately the same length (N), this is a Napproximately the same length (N), this is a N22 problem. problem.

To include gaps, we would have to repeat the To include gaps, we would have to repeat the

calculation 2N times to examine the possibility of gaps calculation 2N times to examine the possibility of gaps

at each possible position within the sequences, now a at each possible position within the sequences, now a

NN4N4N problem. There’s no way! We need an algorithm. problem. There’s no way! We need an algorithm.

Exact alignment — but how can we ‘see’ the Exact alignment — but how can we ‘see’ the correspondence of individual residues?correspondence of individual residues?

But —But —Just what the heck is an algorithm?Just what the heck is an algorithm?

Merriam-Webster’s says: “A rule Merriam-Webster’s says: “A rule of procedure for solving a of procedure for solving a problem [often mathematical] problem [often mathematical] that frequently involves repetition that frequently involves repetition of an operation.”of an operation.”

So, you could write an algorithm So, you could write an algorithm for tying your shoe! It’s just a set for tying your shoe! It’s just a set of explicit instructions for doing of explicit instructions for doing some routine task.some routine task.

Enter the Dynamic Programming Algorithm!Enter the Dynamic Programming Algorithm!Computer scientists figured it out long ago; Computer scientists figured it out long ago; Needleman and Wunsch applied it to the alignment of Needleman and Wunsch applied it to the alignment of the full lengths of two sequences in 1970. An the full lengths of two sequences in 1970. An optimal alignment is defined as an arrangement of optimal alignment is defined as an arrangement of two sequences, 1 of length two sequences, 1 of length ii and 2 of length and 2 of length jj, , such that:such that:

1)1) you maximize the number of matching symbols you maximize the number of matching symbols between 1 and 2;between 1 and 2;2)2) you minimize the number of indels within 1 and you minimize the number of indels within 1 and 2; and2; and3)3) you minimize the number of mismatched symbols you minimize the number of mismatched symbols between 1 and 2.between 1 and 2.

Therefore, the actual solution can be Therefore, the actual solution can be represented by:represented by:

SSii-1 -1 jj-1-1 or or

max Smax Si-xi-x j-j-11 + w + wx-x-11 or or

SSijij = s = sijij + max 2 < + max 2 < xx < < ii

max Smax Sii-1 -1 j-yj-y + w + wy-y-11

2 < 2 < yy < < IIWhere SWhere Sij ij is the score for the alignment ending at is the score for the alignment ending at ii in in

sequence 1 and sequence 1 and jj in sequence 2, in sequence 2,ssijij is the score for aligning is the score for aligning ii with with jj,,

wwxx is the score for making a is the score for making a xx long gap in sequence long gap in sequence

1,1,wwyy is the score for making a is the score for making a yy long gap in sequence long gap in sequence

2,2,allowing gaps to be any length in either sequence.allowing gaps to be any length in either sequence.

An oversimplified path matrix example:An oversimplified path matrix example:

total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one here}])here}])

Optimum Alignments —Optimum Alignments —There may be more than one best path through the There may be more than one best path through the matrix (and optimum doesn’t guarantee matrix (and optimum doesn’t guarantee biologically correct). Starting at the top and biologically correct). Starting at the top and working down, then tracing back, the two best working down, then tracing back, the two best trace-back routes define the following two trace-back routes define the following two alignments:alignments:

cTATAtAagg cTATAtAaggcTATAtAagg cTATAtAagg| ||||| and |||||| ||||| and |||||cg.TAtAaT. .cgTAtAaT.cg.TAtAaT. .cgTAtAaT.

With the example’s scoring scheme these alignments have a score With the example’s scoring scheme these alignments have a score of 5, the highest bottom-right score in the trace-back path graph, of 5, the highest bottom-right score in the trace-back path graph, and the sum of six matches minus one interior gap. This is the and the sum of six matches minus one interior gap. This is the number optimized by the algorithm, not any type of a similarity or number optimized by the algorithm, not any type of a similarity or identity percentage, here 75% and 62% respectively! Software will identity percentage, here 75% and 62% respectively! Software will report only one optimal solution.report only one optimal solution.

This was a Needleman Wunsch global solution. Smith Waterman This was a Needleman Wunsch global solution. Smith Waterman style local solutions use negative numbers in the match matrix and style local solutions use negative numbers in the match matrix and pick the best diagonal within the overall graph.pick the best diagonal within the overall graph.

What about proteins — conservative replacements and What about proteins — conservative replacements and

similarity as opposed to identity. The nitrogenous similarity as opposed to identity. The nitrogenous

bases are either the same or they’re not, but amino bases are either the same or they’re not, but amino

acids can be similar, genetically, evolutionarily, and acids can be similar, genetically, evolutionarily, and

structurally! structurally! The BLOSUM62 table ( The BLOSUM62 table (Henikoff and Henikoff, 1992).Henikoff and Henikoff, 1992).

Identity values range from 4 to 11, some similarities are as high as 3, and negative values for those Identity values range from 4 to 11, some similarities are as high as 3, and negative values for those substitutions that rarely occur go as low as –4. The most conserved residue is tryptophan with a substitutions that rarely occur go as low as –4. The most conserved residue is tryptophan with a score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for identity.score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for identity.

AA BB CC DD EE FF GG HH II KK LL MM NN PP QQ RR SS TT VV WW XX YY ZZ

AA 44 -2-2 00 -2-2 -1-1 -2-2 00 -2-2 -1-1 -1-1 -1-1 -1-1 -2-2 -1-1 -1-1 -1-1 11 00 00 -3-3 -1-1 -2-2 -1-1

BB -2-2 66 -3-3 66 22 -3-3 -1-1 -1-1 -3-3 -1-1 -4-4 -3-3 11 -1-1 00 -2-2 00 -1-1 -3-3 -4-4 -1-1 -3-3 22

CC 00 -3-3 99 -3-3 -4-4 -2-2 -3-3 -3-3 -1-1 -3-3 -1-1 -1-1 -3-3 -3-3 -3-3 -3-3 -1-1 -1-1 -1-1 -2-2 -1-1 -2-2 -4-4

DD -2-2 66 -3-3 66 22 -3-3 -1-1 -1-1 -3-3 -1-1 -4-4 -3-3 11 -1-1 00 -2-2 00 -1-1 -3-3 -4-4 -1-1 -3-3 22

EE -1-1 22 -4-4 22 55 -3-3 -2-2 00 -3-3 11 -3-3 -2-2 00 -1-1 22 00 00 -1-1 -2-2 -3-3 -1-1 -2-2 55

FF -2-2 -3-3 -2-2 -3-3 -3-3 66 -3-3 -1-1 00 -3-3 00 00 -3-3 -4-4 -3-3 -3-3 -2-2 -2-2 -1-1 11 -1-1 33 -3-3

GG 00 -1-1 -3-3 -1-1 -2-2 -3-3 66 -2-2 -4-4 -2-2 -4-4 -3-3 00 -2-2 -2-2 -2-2 00 -2-2 -3-3 -2-2 -1-1 -3-3 -2-2

HH -2-2 -1-1 -3-3 -1-1 00 -1-1 -2-2 88 -3-3 -1-1 -3-3 -2-2 11 -2-2 00 00 -1-1 -2-2 -3-3 -2-2 -1-1 22 00

II -1-1 -3-3 -1-1 -3-3 -3-3 00 -4-4 -3-3 44 -3-3 22 11 -3-3 -3-3 -3-3 -3-3 -2-2 -1-1 33 -3-3 -1-1 -1-1 -3-3

KK -1-1 -1-1 -3-3 -1-1 11 -3-3 -2-2 -1-1 -3-3 55 -2-2 -1-1 00 -1-1 11 22 00 -1-1 -2-2 -3-3 -1-1 -2-2 11

LL -1-1 -4-4 -1-1 -4-4 -3-3 00 -4-4 -3-3 22 -2-2 44 22 -3-3 -3-3 -2-2 -2-2 -2-2 -1-1 11 -2-2 -1-1 -1-1 -3-3

MM -1-1 -3-3 -1-1 -3-3 -2-2 00 -3-3 -2-2 11 -1-1 22 55 -2-2 -2-2 00 -1-1 -1-1 -1-1 11 -1-1 -1-1 -1-1 -2-2

NN -2-2 11 -3-3 11 00 -3-3 00 11 -3-3 00 -3-3 -2-2 66 -2-2 00 00 11 00 -3-3 -4-4 -1-1 -2-2 00

PP -1-1 -1-1 -3-3 -1-1 -1-1 -4-4 -2-2 -2-2 -3-3 -1-1 -3-3 -2-2 -2-2 77 -1-1 -2-2 -1-1 -1-1 -2-2 -4-4 -1-1 -3-3 -1-1

QQ -1-1 00 -3-3 00 22 -3-3 -2-2 00 -3-3 11 -2-2 00 00 -1-1 55 11 00 -1-1 -2-2 -2-2 -1-1 -1-1 22

RR -1-1 -2-2 -3-3 -2-2 00 -3-3 -2-2 00 -3-3 22 -2-2 -1-1 00 -2-2 11 55 -1-1 -1-1 -3-3 -3-3 -1-1 -2-2 00

SS 11 00 -1-1 00 00 -2-2 00 -1-1 -2-2 00 -2-2 -1-1 11 -1-1 00 -1-1 44 11 -2-2 -3-3 -1-1 -2-2 00

TT 00 -1-1 -1-1 -1-1 -1-1 -2-2 -2-2 -2-2 -1-1 -1-1 -1-1 -1-1 00 -1-1 -1-1 -1-1 11 55 00 -2-2 -1-1 -2-2 -1-1

VV 00 -3-3 -1-1 -3-3 -2-2 -1-1 -3-3 -3-3 33 -2-2 11 11 -3-3 -2-2 -2-2 -3-3 -2-2 00 44 -3-3 -1-1 -1-1 -2-2

WW -3-3 -4-4 -2-2 -4-4 -3-3 11 -2-2 -2-2 -3-3 -3-3 -2-2 -1-1 -4-4 -4-4 -2-2 -3-3 -3-3 -2-2 -3-3 1111 -1-1 22 -3-3

XX -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1

YY -2-2 -3-3 -2-2 -3-3 -2-2 33 -3-3 22 -1-1 -2-2 -1-1 -1-1 -2-2 -3-3 -1-1 -2-2 -2-2 -2-2 -1-1 22 -1-1 77 -2-2

ZZ -1-1 22 -4-4 22 55 -3-3 -2-2 00 -3-3 11 -3-3 -2-2 00 -1-1 22 00 00 -1-1 -2-2 -3-3 -1-1 -2-2 55

We can imagine screening databases for sequences We can imagine screening databases for sequences

similar to ours using the concepts of dynamic similar to ours using the concepts of dynamic

programming and substitution scoring matrices and programming and substitution scoring matrices and

some yet to be described algorithmic tricks. But what do some yet to be described algorithmic tricks. But what do

database searches tell us; what can we gain from them?database searches tell us; what can we gain from them?

Why even bother? Why even bother? Inference through homology Inference through homology

is a fundamental principle of biologyis a fundamental principle of biology!!

When a sequence is found to fall into a preexisting family When a sequence is found to fall into a preexisting family

we may be able to infer function, mechanism, evolution, we may be able to infer function, mechanism, evolution,

perhaps even structure, based on homology with its perhaps even structure, based on homology with its

neighbors. If no significant similarity can be found, the neighbors. If no significant similarity can be found, the

very fact that your sequence is new and different could very fact that your sequence is new and different could

be very important. Granted, its characterization may be very important. Granted, its characterization may

prove difficult, but it could be well worth it.prove difficult, but it could be well worth it.

So, first — So, first — significancesignificance: :

when is any alignment worth when is any alignment worth

anything biologically?anything biologically?

An old statistics trick — An old statistics trick — Monte CarloMonte Carlo simulations: simulations:

Z scoreZ score = [ = [ ( actual score ) - ( mean of randomized scores )( actual score ) - ( mean of randomized scores ) ] ]

( standard deviation of randomized score distribution )( standard deviation of randomized score distribution )

Independent of all that, what is a Independent of all that, what is a

‘good’ alignment?‘good’ alignment?

The The NormalNormal distributiondistribution — —

Many Z scores measure the distance from the mean Many Z scores measure the distance from the mean

using this simplistic Monte Carlo model assuming a using this simplistic Monte Carlo model assuming a

Gaussian distribution, a.k.a. the Normal distribution Gaussian distribution, a.k.a. the Normal distribution

((http://mathworld.wolfram.com/NormalDistribution.html),http://mathworld.wolfram.com/NormalDistribution.html),

in spite of the fact that ‘sequence-space’ actually in spite of the fact that ‘sequence-space’ actually

follows what is know as the ‘Extreme Value follows what is know as the ‘Extreme Value

distribution.’distribution.’

However, the Monte Carlo method does approximate However, the Monte Carlo method does approximate

significance estimates fairly well.significance estimates fairly well.

< 20 650 0:==

< 20 650 0:==

22 0 0:

22 0 0:

24 3 0:=

24 3 0:=

26 22 8:*

26 22 8:*

28 98 87:*

28 98 87:*

30 289 528:*

30 289 528:*

32 1714 2042:===*

32 1714 2042:===*

34 5585 5539:=========*

34 5585 5539:=========*

36 12495 11375:==================*==

36 12495 11375:==================*==

38 21957 18799:===============================*=====

38 21957 18799:===============================*=====

40 28875 26223:===========================================*====

40 28875 26223:===========================================*====

42 34153 32054:=====================================================*===

42 34153 32054:=====================================================*===

44 35427 35359:==========================================================*

44 35427 35359:==========================================================*

46 36219 36014:===========================================================*

46 36219 36014:===========================================================*

48 33699 34479:======================================================== *

48 33699 34479:======================================================== *

50 30727 31462:=================================================== *

50 30727 31462:=================================================== *

52 27288 27661:=============================================*

52 27288 27661:=============================================*

54 22538 23627:====================================== *

54 22538 23627:====================================== *

56 18055 19736:============================== *

56 18055 19736:============================== *

58 14617 16203:========================= *

58 14617 16203:========================= *

60 12595 13125:=====================*

60 12595 13125:=====================*

62 10563 10522:=================*

62 10563 10522:=================*

64 8626 8368:=============*=

64 8626 8368:=============*=

66 6426 6614:==========*

66 6426 6614:==========*

68 4770 5203:========*

68 4770 5203:========*

70 4017 4077:======*

70 4017 4077:======*

72 2920 3186:=====*

72 2920 3186:=====*

74 2448 2484:====*

74 2448 2484:====*

76 1696 1933:===*

76 1696 1933:===*

78 1178 1503:==*

78 1178 1503:==*

80 935 1167:=*

80 935 1167:=*

82 722 893:=*

82 722 893:=*

84 454 707:=*

84 454 707:=*

86 438 547:*

86 438 547:*

88 322 423:*

88 322 423:*

90 257 328:*

90 257 328:*

92 175 253:*

92 175 253:*

94 210 196:*

94 210 196:*

96 102 152:*

96 102 152:*

98 63 117:*

98 63 117:*

100 58 91:*

100 58 91:*

102 40 70:*

102 40 70:*

104 30 54:*

104 30 54:*

106 17 42:*

106 17 42:*

108 14 33:*

108 14 33:*

110 14 25:*

110 14 25:*

112 12 20:*

112 12 20:*

114 9 15:*

114 9 15:*

116 6 12:*

116 6 12:*

118 8 9:*

118 8 9:*

>120 1030 7:*=

>120 1030 7:*=

Based on this known statistical Based on this known statistical

distribution, and robust distribution, and robust

statistical methodology, a statistical methodology, a

realistic realistic ExpectationExpectation function, function,

the the E ValueE Value, can be calculated , can be calculated

from database searches.from database searches.

The ‘take-home’ message is . . .The ‘take-home’ message is . . .

‘‘Sequence-space’ Sequence-space’ (Huh, what’s that?)(Huh, what’s that?)

actually follows the ‘Extreme Value distribution’actually follows the ‘Extreme Value distribution’((http://mathworld.wolfram.com/ExtremeValueDistribution.html).http://mathworld.wolfram.com/ExtremeValueDistribution.html).

The Expectation Value!The Expectation Value!The higher the E value is, the more probable that the The higher the E value is, the more probable that the

observed match is due to chance in a search of the observed match is due to chance in a search of the

same size database, and the lower its Z score will be, same size database, and the lower its Z score will be,

i.e. is NOT significant. Therefore, the smaller the E i.e. is NOT significant. Therefore, the smaller the E

value, i.e. the closer it is to zero, the more significant it value, i.e. the closer it is to zero, the more significant it

is and the higher its Z score will be! The E value is the is and the higher its Z score will be! The E value is the

number that really matters. number that really matters. In other words, in order to In other words, in order to

assess whether a given alignment constitutes evidence assess whether a given alignment constitutes evidence

for homology, it helps to know how strong an alignment for homology, it helps to know how strong an alignment

can be expected from chance alone.can be expected from chance alone.

Rules of thumb for a protein search —Rules of thumb for a protein search —

The Z score represents the number of standard deviations some The Z score represents the number of standard deviations some

particular alignment is from a distribution of random alignments particular alignment is from a distribution of random alignments

(often the Normal distribution).(often the Normal distribution).

They They very roughlyvery roughly correspond to the listed E Values (based on correspond to the listed E Values (based on

the Extreme Value distribution) for a typical protein sequence the Extreme Value distribution) for a typical protein sequence

similarity search through a database with ~250,000 protein similarity search through a database with ~250,000 protein

entries.entries.

On to the searches —On to the searches —How can you search the databases for similar How can you search the databases for similar

sequences, if pairwise alignments take Nsequences, if pairwise alignments take N22 time?! time?!

Significance and heuristics . . . Significance and heuristics . . .

Database searching programs use the two concepts of Database searching programs use the two concepts of

dynamic programming and substitution scoring matrices; dynamic programming and substitution scoring matrices;

however, dynamic programming takes far too long when however, dynamic programming takes far too long when

used against most sequence databases with a ‘normal’ used against most sequence databases with a ‘normal’

computer. Remember computer. Remember how bighow big the databases are! the databases are!

Therefore, the programs use tricks to make things Therefore, the programs use tricks to make things

happen faster. These tricks fall into two main categories, happen faster. These tricks fall into two main categories,

that of that of hashinghashing, and that of , and that of approximationapproximation..

Corn beef hash? Huh . . .Corn beef hash? Huh . . .Hashing is the process of breaking your sequence into Hashing is the process of breaking your sequence into

small ‘words’ or ‘k-tuples’ (think all chopped up, just like small ‘words’ or ‘k-tuples’ (think all chopped up, just like

corn beef hash) of a set size and creating a ‘look-up’ corn beef hash) of a set size and creating a ‘look-up’

table with those words keyed to position numbers. table with those words keyed to position numbers.

Computers can deal with numbers way faster than they Computers can deal with numbers way faster than they

can deal with strings of letters, and this preprocessing can deal with strings of letters, and this preprocessing

step happens very quickly.step happens very quickly.

Then when any of the word positions match part of an Then when any of the word positions match part of an

entry in the database, that match, the ‘offset,’ is saved. entry in the database, that match, the ‘offset,’ is saved.

In general, hashing reduces the complexity of the search In general, hashing reduces the complexity of the search

problem from Nproblem from N22 for dynamic programming to N, the for dynamic programming to N, the

length of all the sequences in the database.length of all the sequences in the database.

OK. Heuristics . . . What’s that?OK. Heuristics . . . What’s that?Approximation techniques are collectively known as ‘heuristics.’ Approximation techniques are collectively known as ‘heuristics.’

Webster’s defines heuristic as “serving to guide, discover, or Webster’s defines heuristic as “serving to guide, discover, or

reveal; . . . but unproved or incapable of proof.”reveal; . . . but unproved or incapable of proof.”

In database similarity searching techniques the heuristic usually In database similarity searching techniques the heuristic usually

restricts the necessary search space by calculating some sort of a restricts the necessary search space by calculating some sort of a

statistic that allows the program to decide whether further scrutiny statistic that allows the program to decide whether further scrutiny

of a particular match should be pursued. This statistic may miss of a particular match should be pursued. This statistic may miss

things depending on the parameters set — that’s what makes it things depending on the parameters set — that’s what makes it

heuristic. heuristic. ‘Worthwhile’ results at the end are compiled and the ‘Worthwhile’ results at the end are compiled and the

longest alignment within the program’s restrictions is created.longest alignment within the program’s restrictions is created.

The exact implementation varies between the different programs, The exact implementation varies between the different programs,

but the basic idea follows in most all of them.but the basic idea follows in most all of them.

Two predominant versions exist: BLAST and FastTwo predominant versions exist: BLAST and Fast

Both return local alignments, and are not a single program, but Both return local alignments, and are not a single program, but

rather a family of programs with implementations designed to rather a family of programs with implementations designed to

compare a sequence to a database in about every which way compare a sequence to a database in about every which way

imaginable.imaginable.

These include:These include:

1)1) a DNA sequence against a DNA database (not recommended unless a DNA sequence against a DNA database (not recommended unless

forced to do so because you are dealing with a non-translated region of forced to do so because you are dealing with a non-translated region of

the genome — DNA is just too darn noisy, only identity & four bases!),the genome — DNA is just too darn noisy, only identity & four bases!),

2)2) a translated (where the translation is done ‘on-the-fly’ in all six frames) a translated (where the translation is done ‘on-the-fly’ in all six frames)

version of a DNA sequence against a translated (‘on-the-fly’ six-frame) version of a DNA sequence against a translated (‘on-the-fly’ six-frame)

version of the DNA database (not available in the Fast package),version of the DNA database (not available in the Fast package),

3)3) a translated (‘on-the-fly’ six-frame) version of a DNA sequence against a a translated (‘on-the-fly’ six-frame) version of a DNA sequence against a

protein database,protein database,

4)4) a protein sequence against a translated (‘on-the-fly’ six-frame) version of a protein sequence against a translated (‘on-the-fly’ six-frame) version of

a DNA database,a DNA database,

5)5) or a protein sequence against a protein database.or a protein sequence against a protein database.

Translated comparisons allow penalty-free frame shifts.Translated comparisons allow penalty-free frame shifts.

The BLAST and Fast programs — some generalitiesThe BLAST and Fast programs — some generalities

BLAST — Basic Local Alignment BLAST — Basic Local Alignment

Search Tool, developed at NCBI.Search Tool, developed at NCBI.

1)1) Normally NOT a good idea Normally NOT a good idea

to use for DNA against to use for DNA against

DNA searches w/o DNA searches w/o

translation (not optimized);translation (not optimized);

2)2) Pre-filters repeat and “low Pre-filters repeat and “low

complexity” sequence complexity” sequence

regions;regions;

4)4) Can find more than one Can find more than one

region of gapped similarity;region of gapped similarity;

5)5) Very fast heuristic and Very fast heuristic and

parallel implementation;parallel implementation;

6)6) Restricted to precompiled, Restricted to precompiled,

specially formatted specially formatted

databases;databases;

FastA — and its family of relatives, FastA — and its family of relatives,

developed by Bill Pearson at the developed by Bill Pearson at the

University of Virginia.University of Virginia.

1)1) Works well for DNA Works well for DNA

against DNA searches against DNA searches

(within limits of possible (within limits of possible

sensitivity);sensitivity);

2)2) Can find only one gapped Can find only one gapped

region of similarity;region of similarity;

3)3) Relatively slow, should Relatively slow, should

often be run in the often be run in the

background;background;

4)4) Does not require specially Does not require specially

prepared, preformatted prepared, preformatted

databases.databases.

The algorithms, in brief —The algorithms, in brief —

BLAST:BLAST:

Fast:Fast:

Two word hits on the Two word hits on the same diagonal above same diagonal above some some similaritysimilarity threshold triggers threshold triggers ungapped extension ungapped extension until the score isn’t until the score isn’t improved enough above improved enough above another threshold:another threshold:

the HSP.the HSP.

Find all ungapped Find all ungapped exact exact word hits; maximize the word hits; maximize the ten best continuous ten best continuous regions’ scores: regions’ scores: init1init1..

Combine non-Combine non-overlapping init overlapping init regions on different regions on different diagonals:diagonals:initninitn..

Use dynamic Use dynamic programming ‘in a programming ‘in a band’ for all regions band’ for all regions with with initninitn scores scores better than some better than some threshold: threshold: optopt score.score.

Initiate gapped extensions Initiate gapped extensions using dynamic programming for using dynamic programming for those HSP’s above a third those HSP’s above a third threshold up to the point where threshold up to the point where the score starts to drop below a the score starts to drop below a fourth threshold: yields fourth threshold: yields alignment.alignment.

BLAST — the algorithm in more detail —BLAST — the algorithm in more detail —1)1) After BLAST has sorted its lookup table, it tries to find all double word After BLAST has sorted its lookup table, it tries to find all double word

hits along the same diagonal within some specified distance using what hits along the same diagonal within some specified distance using what

NCBI calls a Discrete Finite Automaton (DFA). These word hits of size NCBI calls a Discrete Finite Automaton (DFA). These word hits of size

WW do not have to be identical; rather, they have to be better than some do not have to be identical; rather, they have to be better than some

threshold value threshold value TT. To identify these double word hits, the DFA scans . To identify these double word hits, the DFA scans

through all strings of words (typically through all strings of words (typically WW=3 for peptides) that score at =3 for peptides) that score at

least least TT (usually 11 for peptides). (usually 11 for peptides).

2)2) Each double word hit that passes this step then triggers a process called Each double word hit that passes this step then triggers a process called

un-gapped extension in both directions, such that each diagonal is un-gapped extension in both directions, such that each diagonal is

extended as far as it can, until the running score starts to drop below a extended as far as it can, until the running score starts to drop below a

pre-defined value pre-defined value XX within a certain range within a certain range AA. The result of this pass is . The result of this pass is

called a High-Scoring segment Pair or HSP.called a High-Scoring segment Pair or HSP.

3)3) Those HSPs that pass this step with a score better than Those HSPs that pass this step with a score better than SS then begin a then begin a

gapped extension step utilizing dynamic programming. Those gapped gapped extension step utilizing dynamic programming. Those gapped

alignments with Expectation values better than the user specified cutoff alignments with Expectation values better than the user specified cutoff

are reported. The extreme value distribution of BLAST Expectation are reported. The extreme value distribution of BLAST Expectation

values is precomputed against each precompiled database — this is one values is precomputed against each precompiled database — this is one

area that speeds up the algorithm considerably.area that speeds up the algorithm considerably.

The BLAST algorithm, continued —The BLAST algorithm, continued —The math generalizes thus: for any two sequences of length The math generalizes thus: for any two sequences of length

mm and and nn, local, best alignments are identified as HSPs. , local, best alignments are identified as HSPs.

HSPs are stretches of sequence pairs that cannot be further HSPs are stretches of sequence pairs that cannot be further

improved by extension or trimming, as described above. For improved by extension or trimming, as described above. For

ungapped alignments, the number of expected HSPs with a ungapped alignments, the number of expected HSPs with a

score of at least score of at least SS is given by the formula: is given by the formula:

E = KmneE = Kmness

This is the This is the EE-value for the score -value for the score SS. In a database search . In a database search nn is is

the size of the database in residues, so the size of the database in residues, so NN==mnmn is the search is the search

space size. space size. KK and and are supplied by statistical theory, and, are supplied by statistical theory, and,

as mentioned above, can be calculated by comparison to as mentioned above, can be calculated by comparison to

precomputed, simulated distributions. These two parameters precomputed, simulated distributions. These two parameters

define the statistical significance of an define the statistical significance of an EE-value.-value.

The Fast algorithm — in more detail —The Fast algorithm — in more detail —Fast is an older algorithm than BLAST. The original Fast paper Fast is an older algorithm than BLAST. The original Fast paper

came out in 1988, based on David Lipman’s work in a 1983 paper; came out in 1988, based on David Lipman’s work in a 1983 paper;

the original BLAST paper was published in 1990. Both algorithms the original BLAST paper was published in 1990. Both algorithms

have been upgraded substantially since originally released. have been upgraded substantially since originally released.

Fast was the first widely used, powerful sequence database Fast was the first widely used, powerful sequence database

searching algorithm. Bill Pearson continually refines the programs searching algorithm. Bill Pearson continually refines the programs

such that they remain a viable alternative to BLAST, especially if such that they remain a viable alternative to BLAST, especially if

one is restricted to searching DNA against DNA without translation. one is restricted to searching DNA against DNA without translation.

They are also very helpful in situations where BLAST finds no They are also very helpful in situations where BLAST finds no

significant alignments — arguably, Fast may be more sensitive than significant alignments — arguably, Fast may be more sensitive than

BLAST in these situations.BLAST in these situations.

Fast is also a hashing style algorithm and builds words of a set k-Fast is also a hashing style algorithm and builds words of a set k-

tuple size, by default two for peptides. It then identifies all exact tuple size, by default two for peptides. It then identifies all exact

word matches between the sequence and the database members. word matches between the sequence and the database members.

Note that the word matches must be exact for Fast and only similar, Note that the word matches must be exact for Fast and only similar,

above some threshold, for BLAST.above some threshold, for BLAST.

The Fast algorithm, continued —The Fast algorithm, continued —From these exact word matches:From these exact word matches:

1)1) Scores are assigned to each continuous, ungapped, diagonal by Scores are assigned to each continuous, ungapped, diagonal by

adding all of the exact match BLOSUM values.adding all of the exact match BLOSUM values.

2)2) The ten highest scoring diagonals for each query-database pair The ten highest scoring diagonals for each query-database pair

are then rescored using BLOSUM similarities as well as identities are then rescored using BLOSUM similarities as well as identities

and ends are trimmed to maximize the score. The best of each and ends are trimmed to maximize the score. The best of each

of these is called the of these is called the Init1Init1 score. score.

3)3) Next the program ‘looks’ around to see if nearby off-diagonal Next the program ‘looks’ around to see if nearby off-diagonal Init1Init1

alignments can be combined by incorporating gaps. If so, a new alignments can be combined by incorporating gaps. If so, a new

score, score, InitnInitn, is calculated by summing up all the contributing , is calculated by summing up all the contributing Init1Init1

scores, penalizing gaps with a penalty for each.scores, penalizing gaps with a penalty for each.

4)4) The program then constructs an optimal local alignment for all The program then constructs an optimal local alignment for all

InitnInitn pairs with scores better than some set threshold using a pairs with scores better than some set threshold using a

variation of dynamic programming “in a band.” A sixteen residue variation of dynamic programming “in a band.” A sixteen residue

band centered at the highest band centered at the highest Init1Init1 region is used by default with region is used by default with

peptides. The score generated from this step called peptides. The score generated from this step called optopt..

The Fast algorithm, still continued —The Fast algorithm, still continued —5)5) Next, Fast uses a simple linear regression against the natural Next, Fast uses a simple linear regression against the natural

log of the search set sequence length to calculate a normalized log of the search set sequence length to calculate a normalized

z-score for the sequence pair. Note that this is not the same z-score for the sequence pair. Note that this is not the same

Monte Carlo style Z score described earlier, and can not be Monte Carlo style Z score described earlier, and can not be

directly compared to one. directly compared to one.

6)6) Finally, it compares the distribution of these z-scores to the Finally, it compares the distribution of these z-scores to the

actual extreme-value distribution of the searchactual extreme-value distribution of the search. Using this . Using this

distribution, the program estimates the number of sequences distribution, the program estimates the number of sequences

that would be expected to have, purely by chance, a z-score that would be expected to have, purely by chance, a z-score

greater than or equal to the z-score obtained in the search. This greater than or equal to the z-score obtained in the search. This

is reported as the Expectation value. is reported as the Expectation value.

7)7) If the user requests pair-wise alignments in the output, then the If the user requests pair-wise alignments in the output, then the

program uses full Smith-Waterman local dynamic programming, program uses full Smith-Waterman local dynamic programming,

not ‘restricted to a band,’ to produce its final alignments.not ‘restricted to a band,’ to produce its final alignments.

What’s the deal with DNA versus protein for What’s the deal with DNA versus protein for searches and alignment?searches and alignment?

All database similarity searching and sequence alignment, All database similarity searching and sequence alignment,

regardless of the algorithm used, is far more sensitive at the amino regardless of the algorithm used, is far more sensitive at the amino

acid level than at the DNA level. This is because proteins have acid level than at the DNA level. This is because proteins have

twenty match criteria versus DNA’s four, and those four DNA twenty match criteria versus DNA’s four, and those four DNA

bases can generally only be identical, not similar, to each other; bases can generally only be identical, not similar, to each other;

and many DNA base changes (especially third position changes) and many DNA base changes (especially third position changes)

do not change the encoded protein.do not change the encoded protein.

All of these factors drastically increase the ‘noise’ level of a DNA All of these factors drastically increase the ‘noise’ level of a DNA

against DNA search, and give protein searches a much greater against DNA search, and give protein searches a much greater

‘look-back’ time, at least doubling it. ‘look-back’ time, at least doubling it.

Therefore, whenever dealing with coding sequence, it is always Therefore, whenever dealing with coding sequence, it is always

prudent to search at the protein level!prudent to search at the protein level!

So what; why even bother?So what; why even bother?More data yields stronger analyses — as More data yields stronger analyses — as long as it is done carefully!long as it is done carefully!

Mosaic ideas and evolutionary ‘importance.’Mosaic ideas and evolutionary ‘importance.’

Applications:Applications:

Probe, primer, and motif design;Probe, primer, and motif design;

Graphical illustrations;Graphical illustrations;

Comparative ‘homology’ inference;Comparative ‘homology’ inference;

Molecular evolutionary analysis.Molecular evolutionary analysis.

All right — how do you do it?All right — how do you do it?

On to multiple sequence alignment & analysis —On to multiple sequence alignment & analysis —

Dynamic programming’s complexity Dynamic programming’s complexity increases exponentially with the number of increases exponentially with the number of sequences being compared:sequences being compared:

N-dimensional matrix . . . .N-dimensional matrix . . . .complexity=[sequence length]complexity=[sequence length]number of sequencesnumber of sequences

i.e. complexity is i.e. complexity is OO((eenn))

Use different types of ‘tricks.’ See —Use different types of ‘tricks.’ See —

MSA (‘global’ within ‘bounding box’) andMSA (‘global’ within ‘bounding box’) and

PIMA (‘local’ portions only)PIMA (‘local’ portions only)

— — but, both of these programs have but, both of these programs have severely limiting restrictions!severely limiting restrictions!

‘‘Global’ heuristic solutions of the Global’ heuristic solutions of the

N-dimensional matrix —N-dimensional matrix —

Therefore, the most Therefore, the most

common implementation, common implementation,

pairwise, progressive pairwise, progressive

dynamic programming, dynamic programming,

restricts the solution to the restricts the solution to the

neighborhood of only two neighborhood of only two

sequences at a time.sequences at a time.

All sequences are All sequences are

compared, pairwise, and compared, pairwise, and

then each is aligned to its then each is aligned to its

most similar partner or most similar partner or

group of partners. Each group of partners. Each

group of partners is then group of partners is then

aligned to finish the aligned to finish the

complete multiple complete multiple

sequence alignment.sequence alignment.

Multiple Sequence Dynamic Programming —Multiple Sequence Dynamic Programming —

Web resources for pairwise, Web resources for pairwise, progressive multiple alignmentprogressive multiple alignment

in the USA, include the Baylor College of in the USA, include the Baylor College of

Medicine’s Search Launcher —Medicine’s Search Launcher —

http://searchlauncher.bcm.tmc.edu/http://searchlauncher.bcm.tmc.edu/

However, problems with large datasets and However, problems with large datasets and

huge multiple alignments make doing multiple huge multiple alignments make doing multiple

sequence alignment on the Web impractical sequence alignment on the Web impractical

after your dataset has reached a certain size. after your dataset has reached a certain size.

You’ll know it when you’re there!You’ll know it when you’re there!

So, what else is available?So, what else is available?Stand-alone ClustalW is available for all Stand-alone ClustalW is available for all

operating systems; its graphical user interface operating systems; its graphical user interface

ClustalX, makes running it very easy.ClustalX, makes running it very easy.

And dedicated biocomputing server suites, like And dedicated biocomputing server suites, like

the GCG Wisconsin Package, which includes the GCG Wisconsin Package, which includes

PileUp and ClustalW and the SeqLab graphical PileUp and ClustalW and the SeqLab graphical

user interface, are another powerful solution.user interface, are another powerful solution.

Furthermore, newer software such as TCoffee, Furthermore, newer software such as TCoffee,

MUSCLE, ProbCons, POA, MAFFT, etc. add MUSCLE, ProbCons, POA, MAFFT, etc. add

various tweaks and tricks to make the entire various tweaks and tricks to make the entire

process more accurate and/or faster. process more accurate and/or faster.

Reliability and the Reliability and the Comparative Approach —Comparative Approach —

explicit homologous correspondence;explicit homologous correspondence;

manual adjustments based on manual adjustments based on knowledge,knowledge,

especially structural, regulatory, and especially structural, regulatory, and functional sites.functional sites.

Therefore, editors like SeqLab andTherefore, editors like SeqLab and

the Ribosomal Database Project:the Ribosomal Database Project:

http://rdp.cme.msu.edu/index.jsphttp://rdp.cme.msu.edu/index.jsp

Structural & Functional correspondence in Structural & Functional correspondence in the Wisconsin Package’s SeqLab —the Wisconsin Package’s SeqLab —

As with pairwise methods, work As with pairwise methods, work

with proteins! with proteins! If at all possible —If at all possible —

Twenty match symbols versus four, plus Twenty match symbols versus four, plus

similarity! Way better signal to noise.similarity! Way better signal to noise.

Also guarantees no indels are placed Also guarantees no indels are placed

within codons. So translate, then align.within codons. So translate, then align.

Nucleotide sequences will only reliably Nucleotide sequences will only reliably

align if they are align if they are veryvery similarsimilar to each to each

other. And they will require extensive other. And they will require extensive

hand editing and careful consideration.hand editing and careful consideration.

Beware of aligning apples and Beware of aligning apples and oranges oranges [[and grapefruitand grapefruit]]!!

Receptor versus Receptor versus activator, on activator, on ad ad nauseamnauseam;;

parologue versus parologue versus orthologue;orthologue;

genomic versus cDNA;genomic versus cDNA;

mature versus mature versus precursor.precursor.

Mask out uncertain areas —Mask out uncertain areas —

Complications —Complications —

Order dependence.Order dependence.

Not that big of a deal.Not that big of a deal.

Substitution matrices and gap penalties.Substitution matrices and gap penalties.

A very big deal!A very big deal!

Regional ‘realignment’ becomes Regional ‘realignment’ becomes

incredibly important, especially with incredibly important, especially with

sequences that have areas of high and sequences that have areas of high and

low similaritylow similarity

There’s a bewildering assortment of bioinformatics databases and ways to There’s a bewildering assortment of bioinformatics databases and ways to access and manipulate the information within them. The key is to learn access and manipulate the information within them. The key is to learn how to use the data and the methods in the most efficient mannerhow to use the data and the methods in the most efficient manner! The ! The better you understand the chemical, physical, and biological systems better you understand the chemical, physical, and biological systems involved, the better your chance of success in analyzing them. Certain involved, the better your chance of success in analyzing them. Certain strategies are inherently more appropriate to others in certain strategies are inherently more appropriate to others in certain circumstances. Making these types of subjective, discriminatory decisions circumstances. Making these types of subjective, discriminatory decisions is one of the most important ‘take-home’ messages I can offer!is one of the most important ‘take-home’ messages I can offer!

Gunnar von Heijne in his old but incredibly readable treatise, Gunnar von Heijne in his old but incredibly readable treatise, Sequence Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), (1987), provides a very appropriate conclusion:provides a very appropriate conclusion:

““Think about what you’re doing; use your knowledge of the molecular Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and direction of inquiry; use as much information as possible; and do not do not blindly accept everything the computer offers youblindly accept everything the computer offers you.”.”

““. . . if any lesson is to be drawn . . . it surely is that to be able to make . . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and a useful contribution one must first and foremost be a biologist, and only second a theoretician . . . . We have to develop better algorithms, only second a theoretician . . . . We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and we have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.”above all we have to become better biologists. But that’s all it takes.”

Conclusions —Conclusions —

References —References —Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular BiologyJournal of Molecular Biology 215, 403-410. 215, 403-410.Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New

Generation of Protein Database Search Programs. Generation of Protein Database Search Programs. Nucleic Acids ResearchNucleic Acids Research 25, 3389-3402. 25, 3389-3402.Bailey, T.L. and Elkan, C., (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers, in Bailey, T.L. and Elkan, C., (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers, in Proceedings of the Second Proceedings of the Second

International Conference on Intelligent Systems for Molecular BiologyInternational Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, California, U.S.A. pp. 28–36., AAAI Press, Menlo Park, California, U.S.A. pp. 28–36.Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids ResearchNucleic Acids Research 20, 2013-2018. 20, 2013-2018.Eddy, S.R. (1996) Hidden Markov models. Eddy, S.R. (1996) Hidden Markov models. Current Opinion in Structural BiologyCurrent Opinion in Structural Biology 6, 361–365. 6, 361–365.Eddy, S.R. (1998) Profile hidden Markov models. Eddy, S.R. (1998) Profile hidden Markov models. BioinformaticsBioinformatics 14, 755--763 14, 755--763Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington,

Seattle, Washington, U.S.A.Seattle, Washington, U.S.A.Feng, D.F. and Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Feng, D.F. and Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular EvolutionJournal of Molecular Evolution 25, 25,

351–360 .351–360 .Genetics Computer Group (GCG) (Copyright 1982-2007) Genetics Computer Group (GCG) (Copyright 1982-2007) Program Manual for the Wisconsin PackageProgram Manual for the Wisconsin Package, Version 10., Accelrys, Inc. A Pharmocopeia , Version 10., Accelrys, Inc. A Pharmocopeia

Company, San Diego, California, U.S.A.Company, San Diego, California, U.S.A.Gilbert, D.G. (1993 [C release] and 1999 [Java release]) ReadSeq, public domain software distributed by the author.Gilbert, D.G. (1993 [C release] and 1999 [Java release]) ReadSeq, public domain software distributed by the author.

http://iubio.bio.indiana.edu/soft/molbio/readseq/http://iubio.bio.indiana.edu/soft/molbio/readseq/ Bioinformatics Group, Biology Department, Indiana University, Bloomington, Indiana,U.S.A. Bioinformatics Group, Biology Department, Indiana University, Bloomington, Indiana,U.S.A.Gribskov, M. and Devereux, J., editors (1992) Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis PrimerSequence Analysis Primer. W.H. Freeman and Company, New York, New York, U.S.A.. W.H. Freeman and Company, New York, New York, U.S.A.Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A.Proc. Natl. Acad. Sci. U.S.A. 84, 4355-4358. 84, 4355-4358.Gupta, S.K., Kececioglu, J.D., and Schaffer, A.A. (1995) Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs Gupta, S.K., Kececioglu, J.D., and Schaffer, A.A. (1995) Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs

multiple sequence alignment. multiple sequence alignment. Journal of Computational BiologyJournal of Computational Biology 2, 459–472. 2, 459–472.Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences U.S.A.Proceedings of the National Academy of Sciences U.S.A.

89, 10915-10919.89, 10915-10919.Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.

Journal of Molecular BiologyJournal of Molecular Biology 48, 443-453. 48, 443-453.Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences U.S.A.Proceedings of the National Academy of Sciences U.S.A. 85, 85,

2444-2448.2444-2448.Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and StructureAtlas of Protein Sequences and Structure, (M.O. Dayhoff , (M.O. Dayhoff

editor) 5, Suppl. 3, 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A.editor) 5, Suppl. 3, 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A.Smith, R.F. and Smith, T.F. (1992) Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties Smith, R.F. and Smith, T.F. (1992) Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties

for comparative protein modelling. for comparative protein modelling. Protein EngineeringProtein Engineering 5, 35–41. 5, 35–41.Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied MathematicsAdvances in Applied Mathematics 2, 482-489. 2, 482-489.Swofford, D.L., PAUP* (Phylogenetic Analysis Using Parsimony and other methods) version 4.0+ (1989–2007) Florida State University, Tallahassee, Swofford, D.L., PAUP* (Phylogenetic Analysis Using Parsimony and other methods) version 4.0+ (1989–2007) Florida State University, Tallahassee,

Florida, U.S.A. Florida, U.S.A. http://paup.csit.fsu.edu/http://paup.csit.fsu.edu/ distributed through Sinaeur Associates, Inc. distributed through Sinaeur Associates, Inc. http://www.sinauer.com/http://www.sinauer.com/ Sunderland, Massachusetts, U.S.A. Sunderland, Massachusetts, U.S.A.Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins, D.G. (1997) The ClustalX windows interface: flexible strategies for multiple Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins, D.G. (1997) The ClustalX windows interface: flexible strategies for multiple

sequence alignment aided by quality analysis tools. sequence alignment aided by quality analysis tools. Nucleic Acids ResearchNucleic Acids Research 24, 4876–4882. 24, 4876–4882.Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through

sequence weighting, positions-specific gap penalties and weight matrix choice. sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids ResearchNucleic Acids Research, 22, 4673-4680., 22, 4673-4680.Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Proceedings of the National Academy of

Sciences U.S.A.Sciences U.S.A. 80, 726-730. 80, 726-730.

A BioInformatics Survey... some taste of theory, and a few practicalities Steve Thompson Steve Thompson Florida State University School of Computational.

Documents

computational biology

sequence database growth

sequence analysis tools

molecular databases

computational techniques

type of biological database

complete genomes

biological system