An Introduction to the GCG SeqLab GUI... some taste of theory, and a few practicalities Steve Thompson Steve Thompson Florida State University School of.

An Introduction to the An Introduction to the GCG SeqLab GUIGCG SeqLab GUI

. . . . . . some taste of theory, and some taste of theory, and a few practicalitiesa few practicalities

Steve Thompson

Florida State University School of Florida State University School of Computational Science (SCS)Computational Science (SCS)

Fort Valley State UniversityFort Valley State University

July 16 & 17July 16 & 17, 2008, 2008

To begin,To begin,some terminology —some terminology —

What is bioinformatics, What is bioinformatics,

genomics, proteomics, genomics, proteomics,

sequence analysis, sequence analysis,

computational molecular computational molecular

biology . . . ?biology . . . ?

My definitions, My definitions, lots of overlaplots of overlapBiocomputingBiocomputing and and computational biologycomputational biology are synonyms and are synonyms and

describe the use of computers and computational techniques describe the use of computers and computational techniques

to analyze any type of a biological system, from individual to analyze any type of a biological system, from individual

molecules to organisms to overall ecology.molecules to organisms to overall ecology.

BioinformaticsBioinformatics describes using computational techniques to describes using computational techniques to

access, analyze, and interpret the biological information in access, analyze, and interpret the biological information in

any type of biological database.any type of biological database.

Sequence analysisSequence analysis is the study of molecular sequence data for is the study of molecular sequence data for

the purpose of inferring the function, interactions, evolution, the purpose of inferring the function, interactions, evolution,

and perhaps structure of biological molecules.and perhaps structure of biological molecules.

GenomicsGenomics analyzes the context of genes or complete genomes analyzes the context of genes or complete genomes

(the total DNA content of an organism) within the same and/or (the total DNA content of an organism) within the same and/or

across different genomes.across different genomes.

ProteomicsProteomics is the subdivision of genomics concerned with is the subdivision of genomics concerned with

analyzing the complete protein complement, i.e. the proteome, analyzing the complete protein complement, i.e. the proteome,

of organisms, both within and between different organisms.of organisms, both within and between different organisms.

And one way to think about it —And one way to think about it —the Reverse Biochemistry Analogythe Reverse Biochemistry AnalogyBiochemists no longer have to begin a research Biochemists no longer have to begin a research

project by isolating and purifying massive amounts project by isolating and purifying massive amounts

of a protein from its native organism in order to of a protein from its native organism in order to

characterize a particular gene product. Rather, characterize a particular gene product. Rather,

now scientists can amplify a section of some now scientists can amplify a section of some

genome based on its similarity to other genomes, genome based on its similarity to other genomes,

sequence that piece of DNA and, sequence that piece of DNA and, using sequence using sequence

analysis tools, infer all sorts of functional, analysis tools, infer all sorts of functional,

regulatory, evolutionary, and, perhaps, structural regulatory, evolutionary, and, perhaps, structural

insight into that stretch of DNA!insight into that stretch of DNA!

The The computercomputer and molecular and molecular databasesdatabases are a are a

necessary, integral part of this entire process.necessary, integral part of this entire process.

The exponential growth of molecular The exponential growth of molecular sequence databasessequence databases

YearYear BasePairs BasePairs

SequencesSequences

19821982 680338 680338

606606

19831983 2274029 2274029

24272427

19841984 3368765 3368765

41754175

19851985 5204420 5204420

57005700

19861986 9615371 9615371

99789978

19871987 1551477615514776

1458414584

19881988 23800000 23800000

2057920579

19891989 34762585 34762585

2879128791

19901990 49179285 49179285

3953339533

19911991 71947426 71947426

5562755627

19921992 101008486 101008486

7860878608

19931993 157152442 157152442

143492143492

19941994 217102462 217102462

215273215273

19951995 384939485 384939485

555694555694

19961996 651972984 651972984

10212111021211

19971997 1160300687 1160300687

17658471765847

19981998 2008761784 2008761784

28378972837897

19991999 3841163011 3841163011

4864570 4864570

20002000 1110106628811101066288

1010602310106023

20012001 1584992143815849921438

1497631014976310

20022002 2850799016628507990166

2231888322318883

20032003 3655336848536553368485

3096841830968418

20042004 4457574517644575745176

4060431940604319

20052005 5603773446256037734462

5201676252016762

20062006 6901929070569019290705

6489374764893747

20072007 8387417973083874179730

8038838280388382

& cpu power& cpu power

Doubling time about a year and half!Doubling time about a year and half!http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

Sequence database growth, continuedSequence database growth, continued

The International Human Genome Sequencing The International Human Genome Sequencing

Consortium announced the completion of the "Working Consortium announced the completion of the "Working

Draft" of the human genome in June 2000; Draft" of the human genome in June 2000;

independently that same month, the private company independently that same month, the private company

Celera Genomics announced that it had completed the announced that it had completed the

first “Assembly” of the human genome. The classic first “Assembly” of the human genome. The classic

articles were published mid-February 2001 in the articles were published mid-February 2001 in the

journals journals Science and and Nature. .

Genome projects keep the data coming at an incredible Genome projects keep the data coming at an incredible

rate. rate. Currently around 50 Archaea, 600 Bacteria, and Currently around 50 Archaea, 600 Bacteria, and

20 Eukaryote complete genomes, and 200 Eukaryote 20 Eukaryote complete genomes, and 200 Eukaryote

assemblies are represented, not counting the almost assemblies are represented, not counting the almost

3,000 virus and viroid genomes available.3,000 virus and viroid genomes available.

Some neat stuff from the human genome papersSome neat stuff from the human genome papers

Homo sapiensHomo sapiens, aren’t nearly as special as we once , aren’t nearly as special as we once thought. Of the 3.2 billion base pairs in our DNA:thought. Of the 3.2 billion base pairs in our DNA:

Traditional gene number estimates were often in the Traditional gene number estimates were often in the 100,000 range; turns out we’ve only got about twice 100,000 range; turns out we’ve only got about twice as many as a fruit fly, between 25’ and 30,000!as many as a fruit fly, between 25’ and 30,000!

The protein coding region of the genome is only about The protein coding region of the genome is only about 1% or so, a bunch of the remainder is ‘jumping,’ 1% or so, a bunch of the remainder is ‘jumping,’ ‘selfish DNA,’ sometimes called ‘junk,’ much of which ‘selfish DNA,’ sometimes called ‘junk,’ much of which may be involved in regulation and control.may be involved in regulation and control.

Some 100-200 genes were transferred from an Some 100-200 genes were transferred from an ancestral bacterial genome to an ancestral ancestral bacterial genome to an ancestral vertebrate genome!vertebrate genome!((Later shown to be false by more extensive analyses, and Later shown to be false by more extensive analyses, and to be due to gene loss not transferto be due to gene loss not transfer.).)

NCBI’s ’s

Entrez Entrez

Sequence databases are an organized way to store exponentially Sequence databases are an organized way to store exponentially

accumulating sequence data. An accumulating sequence data. An ‘alphabet soup’ of t‘alphabet soup’ of three major hree major

organizations maintain them. They largely ‘mirror’ one another and organizations maintain them. They largely ‘mirror’ one another and

share accession codes, but NOT proper identifier names:share accession codes, but NOT proper identifier names:

North America: the National Center for Biotechnology Information (North America: the National Center for Biotechnology Information (

NCBI), a division of the National Library of Medicine (NLM), at the ), a division of the National Library of Medicine (NLM), at the

National Institute of Health (NIH), maintains the National Institute of Health (NIH), maintains the GenBank (& WGS) (& WGS)

nucleotide, GenPept amino acid, and RefSeq genome, nucleotide, GenPept amino acid, and RefSeq genome,

transcriptome, and proteome databases.transcriptome, and proteome databases.

Europe: the European Molecular Biology Laboratory (Europe: the European Molecular Biology Laboratory (EMBL), the ), the

European Bioinformatics Institute (European Bioinformatics Institute (EBI), and the ), and the Swiss Institute of Swiss Institute of

Bioinformatics (SIB) Bioinformatics (SIB) all help maintain theall help maintain the EMBL nucleotide nucleotide

sequence database, andsequence database, and the UNIPROT ( the UNIPROT (SWISS-PROT + + TrEMBL)

amino acid sequence database (with USA PIR/NBRF support also).amino acid sequence database (with USA PIR/NBRF support also).

Asia: TAsia: The National Institute of Genetics (NIG) supports the National Institute of Genetics (NIG) supports the he Center Center

for Information Biology’s (CIG) for Information Biology’s (CIG) DNA Data Bank of Japan (DNA Data Bank of Japan (DDBJ). ).

Let’s start with sequence databasesLet’s start with sequence databases

A little historyA little historyThe first well recognized sequence database was Dr. The first well recognized sequence database was Dr.

Margaret Dayhoff’s hardbound Margaret Dayhoff’s hardbound Atlas of Protein Atlas of Protein

Sequence and StructureSequence and Structure begun in the mid-sixties. begun in the mid-sixties.

That became PIR. That became PIR. DDBJDDBJ began in 1984, began in 1984, GenBankGenBank

in 1982, and in 1982, and EMBLEMBL in 1980. They are all attempts at in 1980. They are all attempts at

establishing an organized, reliable, comprehensive, establishing an organized, reliable, comprehensive,

and openly available library of genetic sequences.and openly available library of genetic sequences.

Sequence databases have long-since outgrown a Sequence databases have long-since outgrown a

hardbound atlas that you can pull off of a library shelf. hardbound atlas that you can pull off of a library shelf.

They have become gargantuan and have evolved They have become gargantuan and have evolved

through many, many changes.through many, many changes.

What are sequence databases like?What are sequence databases like?Just what are primary sequences?Just what are primary sequences?

(Central Dogma: DNA —> RNA —> protein)(Central Dogma: DNA —> RNA —> protein)

Primary refers to one dimension — all of the ‘symbol’ information written in Primary refers to one dimension — all of the ‘symbol’ information written in

sequential order necessary to specify a particular biological molecular sequential order necessary to specify a particular biological molecular

entity, be it polypeptide or nucleotide.entity, be it polypeptide or nucleotide.

The symbols are the one letter codes for all of the biological nitrogenous The symbols are the one letter codes for all of the biological nitrogenous

bases and amino acid residues and their ambiguity codes. Biological bases and amino acid residues and their ambiguity codes. Biological

carbohydrates, lipids, and structural and functional information are not carbohydrates, lipids, and structural and functional information are not

sequence data. Not even DNA CDS protein translations in a DNA sequence data. Not even DNA CDS protein translations in a DNA

database are sequence data!database are sequence data!

However, much of this feature and bibliographic type information is However, much of this feature and bibliographic type information is

available in the reference documentation sections associated with available in the reference documentation sections associated with

primary sequences in the databases.primary sequences in the databases.

Software is required to successfully interact with these databases, and Software is required to successfully interact with these databases, and

access is most easily handled through various software packages and access is most easily handled through various software packages and

interfaces, on the World Wide Web or otherwise. interfaces, on the World Wide Web or otherwise.

Sequence database organizationSequence database organization

Nucleic Acid DB’sNucleic Acid DB’s

GenBank/EMBL/DDBJGenBank/EMBL/DDBJ

all Taxonomic all Taxonomic

categories +categories +

WGS, HTC & HTG +WGS, HTC & HTG +

STS, EST, & GSS, STS, EST, & GSS,

a.k.a.a.k.a. “Tags” “Tags”

Amino Acid DB’sAmino Acid DB’s

UNIPROT =UNIPROT =

SWISS-SWISS-PROT +PROT +

TrEMBL (with TrEMBL (with help from PIR)help from PIR)

GenpeptGenpept

Nucleic acid sequence databases are split into subdivisions based Nucleic acid sequence databases are split into subdivisions based

on taxonomy and data type. TrEMBL sequences are merged into on taxonomy and data type. TrEMBL sequences are merged into

SWISS-PROT as they receive increased levels of annotation. SWISS-PROT as they receive increased levels of annotation.

Both together comprise UNIPROT. GenPept has minimal Both together comprise UNIPROT. GenPept has minimal

annotation.annotation.

Important Important elementselements associated with each sequence entry: associated with each sequence entry:NameName: LOCUS, ENTRY, ID, all are unique identifiers.: LOCUS, ENTRY, ID, all are unique identifiers.DefinitionDefinition: : a.k.a.a.k.a. title, a brief textual sequence description. title, a brief textual sequence description.Accession NumberAccession Number: a constant data identifier.: a constant data identifier.Source and taxonomy information; complete literature Source and taxonomy information; complete literature references; comments and keywords;references; comments and keywords;and the all important and the all important FEATUREFEATURE table! table!A summary or checksum line, and the A summary or checksum line, and the sequencesequence itself. itself.

HoweverHowever::Each major database as well as each major suite of software Each major database as well as each major suite of software tools has its own distinct format requirements. Changes over tools has its own distinct format requirements. Changes over the years are a huge hassle. Standards are argued, e.g. XML, the years are a huge hassle. Standards are argued, e.g. XML, but unfortunately, until all biologists and computer scientists but unfortunately, until all biologists and computer scientists worldwide agree on one standard, and all software is (re)written worldwide agree on one standard, and all software is (re)written to that standard, neither of which is likely to happen very to that standard, neither of which is likely to happen very quickly, if ever, format issues will remain quickly, if ever, format issues will remain one of the most one of the most confusing and troublingconfusing and troubling aspects of working with sequence data. aspects of working with sequence data. Specialized format conversion tools expedite the chore, but Specialized format conversion tools expedite the chore, but becoming familiar with some of the common formats helps a lot.becoming familiar with some of the common formats helps a lot.

Parts and problemsParts and problems

More format complicationsMore format complications

Indels and missing Indels and missing

data symbols (i.e. data symbols (i.e.

gaps) designation gaps) designation

discrepancy discrepancy

headaches —headaches —

., -, ~, ?, N, or X., -, ~, ?, N, or X

. . . . . Help!. . . . . Help!

Specialized ‘sequence’ -type databasesSpecialized ‘sequence’ -type databasesDatabases that contain special types of sequence Databases that contain special types of sequence

information, such as patterns, motifs, and profiles. information, such as patterns, motifs, and profiles.

These include: These include: REBASE, , EPD, , PROSITE, , BLOCKS, ,

ProDom, , Pfam . . . . . . . .

Databases that contain multiple sequence entries Databases that contain multiple sequence entries

aligned, e.g. aligned, e.g. PopSet, , RDP and and ALN..

Databases that contain families of sequences ordered Databases that contain families of sequences ordered

functionally, structurally, or phylogenetically, e.g. functionally, structurally, or phylogenetically, e.g.

iProClass and and HOVERGEN..

Databases of species specific sequences, e.g. the Databases of species specific sequences, e.g. the

HIV Database and the and the Giardia lamblia Genome Project

..

And on and on . . . . See Amos Bairoch’s excellent links And on and on . . . . See Amos Bairoch’s excellent links

page: http://us.expasy.org/alinks.html.page: http://us.expasy.org/alinks.html.

MMap browsers try to tie much of this ap browsers try to tie much of this information together —information together —

Genetic linkage mapping databases for most large Genetic linkage mapping databases for most large

genome projects— genome projects— H. sapiensH. sapiens, , MusMus, , DrosophilaDrosophila, , C. C.

eleganselegans, , SaccharomycesSaccharomyces, , ArabidopsisArabidopsis, , E. coliE. coli . . . . . .

. . . usually link to other databases within the context . . . usually link to other databases within the context

of a genome browser or map viewer.of a genome browser or map viewer.

Examples include: NCBI’s Map Viewer Examples include: NCBI’s Map Viewer

(http://www.ncbi.nlm.nih.gov/mapview/), the Ensemble (http://www.ncbi.nlm.nih.gov/mapview/), the Ensemble

Project (http://www.ensembl.org/), the UCSC Genome Project (http://www.ensembl.org/), the UCSC Genome

Browser at (Browser at (http://genome.ucsc.edu/http://genome.ucsc.edu/), and the ), and the

Lawrence Livermore National Laboratory ECR Lawrence Livermore National Laboratory ECR

Browser (Browser (http://www.dcode.org/http://www.dcode.org/).).

NCBI’s Map Viewer NCBI’s Map Viewer (http://www.ncbi.nlm.nih.gov/mapview/) —(http://www.ncbi.nlm.nih.gov/mapview/) —

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Sanger Center for BioInformatics Ensembl project (http://www.ensembl.org/) —Sanger Center for BioInformatics Ensembl project (http://www.ensembl.org/) —



University of California, Santa Cruz Genome Browser (http://genome.ucsc.edu/) —University of California, Santa Cruz Genome Browser (http://genome.ucsc.edu/) —

What about other types of biological databases? Three-dimensional structure databases

The Protein Data Bank and Rutgers Nucleic Acid Database.The Protein Data Bank and Rutgers Nucleic Acid Database.

See Molecules to Go at See Molecules to Go at http://molbio.info.nih.gov/cgi-bin/pdb/.http://molbio.info.nih.gov/cgi-bin/pdb/.

These databases contain all of the 3D atomic coordinate data These databases contain all of the 3D atomic coordinate data

necessary to define the tertiary shape of a particular biological necessary to define the tertiary shape of a particular biological

molecule. The data is usually experimentally derived, either by X-molecule. The data is usually experimentally derived, either by X-

ray crystallography or by NMR, sometimes it’s hypothetical. ray crystallography or by NMR, sometimes it’s hypothetical.

Secondary structure boundaries, sequence data, source, Secondary structure boundaries, sequence data, source,

resolution,and references are given in the annotation.resolution,and references are given in the annotation.

These databases enable the technique of homology modeling to These databases enable the technique of homology modeling to

actually work pretty well given your sequence is similar enough to actually work pretty well given your sequence is similar enough to

solved structures (see the automated Swiss-Model server at solved structures (see the automated Swiss-Model server at

http://swissmodel.expasy.org/SWISS-MODEL.html).).

Molecular visualization and/or modeling software is required to Molecular visualization and/or modeling software is required to

interact with the data. It has little meaning on its own.interact with the data. It has little meaning on its own.

And still other types of bioinfo’ databasesAnd still other types of bioinfo’ databasesConsider these ‘non-molecular’ but they often link to molecules:Consider these ‘non-molecular’ but they often link to molecules:

Reference DatabasesReference Databases (all w/ pointers to sequences): e.g. (all w/ pointers to sequences): e.g.

LocusLink/Gene — integrated knowledge baseLocusLink/Gene — integrated knowledge base

OMIM — Online Mendelian Inheritance in ManOMIM — Online Mendelian Inheritance in Man

PubMed/MedLine — over 11 million citations from more PubMed/MedLine — over 11 million citations from more

than 4 thousand bio/medical scientific journals. than 4 thousand bio/medical scientific journals.

Phylogenetic Tree DatabasesPhylogenetic Tree Databases: e.g. the Tree of Life.: e.g. the Tree of Life.

Metabolic Pathway DatabasesMetabolic Pathway Databases: e.g. WIT (What Is There), : e.g. WIT (What Is There),

Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of

Genes and Genomes), and the human Reactome.Genes and Genomes), and the human Reactome.

Population studies dataPopulation studies data — which strains, where, etc. — which strains, where, etc.

And then databases that many biocomputing people don’t even And then databases that many biocomputing people don’t even

usually consider: e.g. GIS/GPS/remote sensing data, medical usually consider: e.g. GIS/GPS/remote sensing data, medical

records, census counts, mortality and birth rates . . . .records, census counts, mortality and birth rates . . . .

Enter pairwise alignment, Enter pairwise alignment,

similarity searching, similarity searching,

significance, and significance, and

homology.homology.

So, given some biological sequence data, So, given some biological sequence data,

what more can we learn about its evolution, what more can we learn about its evolution,

structure, function, mechanism and structure, function, mechanism and

regulation in life?regulation in life?

First, just what is homology and First, just what is homology and

similarity — are they the same?similarity — are they the same?

Don’t confuse homology with similarity: Don’t confuse homology with similarity:

there is a huge difference! Similarity is a there is a huge difference! Similarity is a

statistic that describes how much two statistic that describes how much two

(sub)sequences are alike according to (sub)sequences are alike according to

some set scoring criteria. It can be some set scoring criteria. It can be

normalized to ascertain statistical normalized to ascertain statistical

significance, but it’s still just a number.significance, but it’s still just a number.

implies an evolutionary relationship — more than just implies an evolutionary relationship — more than just

everything evolving from the same primordial ‘ooze.’ everything evolving from the same primordial ‘ooze.’

Reconstruct the phylogeny of the organisms or genes of Reconstruct the phylogeny of the organisms or genes of

interest to demonstrate homology. Better yet, show interest to demonstrate homology. Better yet, show

experimental evidence — structural, morphological, experimental evidence — structural, morphological,

genetic, and/or fossil — that corroborates your claim.genetic, and/or fossil — that corroborates your claim.

There is no such thing as percent homology; something There is no such thing as percent homology; something

is either homologous or it is not. Walter Fitch said is either homologous or it is not. Walter Fitch said

“homology is like pregnancy — you can’t be 45% “homology is like pregnancy — you can’t be 45%

pregnant, just like something can’t be 45% homologous. pregnant, just like something can’t be 45% homologous.

You either are or you are not.” Highly significant You either are or you are not.” Highly significant

similarity can argue for homology, but not the inverse.similarity can argue for homology, but not the inverse.

Homology, in contrast and by definitionHomology, in contrast and by definition

One way — dot matrices.One way — dot matrices.

Provide a ‘Gestalt’ of all Provide a ‘Gestalt’ of all

possible alignments between two possible alignments between two

sequences.sequences.

To begin — very simple 0, 1 To begin — very simple 0, 1

(match, nomatch) identity scoring (match, nomatch) identity scoring

function.function.

Put a dot wherever symbols match.Put a dot wherever symbols match.

OK, so how can we see if two OK, so how can we see if two

sequences are similar? First, to sequences are similar? First, to

introduce the concept, a graphical introduce the concept, a graphical

method . . . method . . .

Identities and insertion/deletion Identities and insertion/deletion

events (indels) identified (zero:one events (indels) identified (zero:one

match score matrix, no window).match score matrix, no window).

Noise due to random composition effects contributes to confusion. To ‘clean up’ Noise due to random composition effects contributes to confusion. To ‘clean up’ the plot consider a filtered windowing approach. A dot is placed at the middle of the plot consider a filtered windowing approach. A dot is placed at the middle of a window if some ‘stringency’ is met within that defined window size. Then the a window if some ‘stringency’ is met within that defined window size. Then the window is shifted one position and the entire process is repeated window is shifted one position and the entire process is repeated (zero:one (zero:one match score, match score, window of size three and a stringency level of two out of threewindow of size three and a stringency level of two out of three).).

We can compare one molecule against another by We can compare one molecule against another by

aligning them. However, a ‘brute force’ approach just aligning them. However, a ‘brute force’ approach just

won’t work. Even without considering the introduction of won’t work. Even without considering the introduction of

gaps, the computation required to compare all possible gaps, the computation required to compare all possible

alignments between two sequences requires time alignments between two sequences requires time

proportional to the product of the lengths of the two proportional to the product of the lengths of the two

sequences. Therefore, if the two sequences are sequences. Therefore, if the two sequences are

approximately the same length (N), this is a Napproximately the same length (N), this is a N22 problem. problem.

To include gaps, we would have to repeat the To include gaps, we would have to repeat the

calculation 2N times to examine the possibility of gaps calculation 2N times to examine the possibility of gaps

at each possible position within the sequences, now a at each possible position within the sequences, now a

NN4N4N problem. There’s no way! We need an algorithm. problem. There’s no way! We need an algorithm.

Exact alignment — but how can we ‘see’ the Exact alignment — but how can we ‘see’ the correspondence of individual residues?correspondence of individual residues?

But . . .But . . .Just what the heck is an algorithm?Just what the heck is an algorithm?

Merriam-Webster’s says: “A rule Merriam-Webster’s says: “A rule of procedure for solving a of procedure for solving a problem [often mathematical] problem [often mathematical] that frequently involves repetition that frequently involves repetition of an operation.”of an operation.”

So, you could write an algorithm So, you could write an algorithm for tying your shoe! It’s just a set for tying your shoe! It’s just a set of explicit instructions for doing of explicit instructions for doing some routine task.some routine task.

Enter the Dynamic Programming Algorithm!Enter the Dynamic Programming Algorithm!Computer scientists figured it out long ago; Computer scientists figured it out long ago; Needleman and Wunsch applied it to the alignment Needleman and Wunsch applied it to the alignment of the full lengths of two sequences in 1970. An of the full lengths of two sequences in 1970. An optimal alignment is defined as an arrangement of optimal alignment is defined as an arrangement of two sequences, 1 of length two sequences, 1 of length ii and 2 of length and 2 of length jj, , such that:such that:

1)1) you maximize the number of matching symbols you maximize the number of matching symbols between 1 and 2;between 1 and 2;2)2) you minimize the number of indels within 1 and you minimize the number of indels within 1 and 2; and2; and3)3) you minimize the number of mismatched symbols you minimize the number of mismatched symbols between 1 and 2.between 1 and 2.

Therefore, the actual solution can be Therefore, the actual solution can be represented by:represented by:

SSii-1 -1 jj-1-1 or or

max Smax Si-xi-x j-j-11 + w + wx-x-11 or or

SSijij = s = sijij + max 2 < + max 2 < xx < < ii

max Smax Sii-1 -1 j-yj-y + w + wy-y-11

2 < 2 < yy < < IIWhere SWhere Sij ij is the score for the alignment ending at is the score for the alignment ending at ii

in sequence 1 and in sequence 1 and jj in sequence 2, in sequence 2,ssijij is the score for aligning is the score for aligning ii with with jj,,

wwxx is the score for making a is the score for making a xx long gap in long gap in

sequence 1,sequence 1,wwyy is the score for making a is the score for making a yy long gap in long gap in

sequence 2,sequence 2,allowing gaps to be any length in either allowing gaps to be any length in either sequence.sequence.

An oversimplified path matrix exampleAn oversimplified path matrix example

total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one here}])here}])

Optimum AlignmentsOptimum AlignmentsThere may be more than one best path through the There may be more than one best path through the matrix (and optimum doesn’t guarantee matrix (and optimum doesn’t guarantee biologically correct). Starting at the top and biologically correct). Starting at the top and working down, then tracing back, the two best working down, then tracing back, the two best trace-back routes define the following two trace-back routes define the following two alignments:alignments:

cTATAtAagg cTATAtAaggcTATAtAagg cTATAtAagg| ||||| and |||||| ||||| and |||||cg.TAtAaT. .cgTAtAaT.cg.TAtAaT. .cgTAtAaT.

With the example’s scoring scheme these alignments have a score With the example’s scoring scheme these alignments have a score of 5, the highest bottom-right score in the trace-back path graph, of 5, the highest bottom-right score in the trace-back path graph, and the sum of six matches minus one interior gap. This is the and the sum of six matches minus one interior gap. This is the number optimized by the algorithm, not any type of a similarity or number optimized by the algorithm, not any type of a similarity or identity percentage, here 75% and 62% respectively! Software will identity percentage, here 75% and 62% respectively! Software will report only one optimal solution.report only one optimal solution.

This was a Needleman Wunsch global solution. Smith Waterman This was a Needleman Wunsch global solution. Smith Waterman style local solutions use negative numbers in the match matrix and style local solutions use negative numbers in the match matrix and pick the best diagonal within the overall graph.pick the best diagonal within the overall graph.

What about proteins — conservative replacements and What about proteins — conservative replacements and

similarity as opposed to identity. The nitrogenous similarity as opposed to identity. The nitrogenous

bases are either the same or they’re not, but amino bases are either the same or they’re not, but amino

acids can be similar, genetically, evolutionarily, and acids can be similar, genetically, evolutionarily, and

structurally! structurally! The BLOSUM62 table ( The BLOSUM62 table (Henikoff and Henikoff, 1992)Henikoff and Henikoff, 1992)

Identity values range from 4 to 11, some similarities are as high as 3, and negative values for those Identity values range from 4 to 11, some similarities are as high as 3, and negative values for those substitutions that rarely occur go as low as –4. The most conserved residue is tryptophan with a substitutions that rarely occur go as low as –4. The most conserved residue is tryptophan with a score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for identity.score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for identity.

AA BB CC DD EE FF GG HH II KK LL MM NN PP QQ RR SS TT VV WW XX YY ZZ

AA 44 -2-2 00 -2-2 -1-1 -2-2 00 -2-2 -1-1 -1-1 -1-1 -1-1 -2-2 -1-1 -1-1 -1-1 11 00 00 -3-3 -1-1 -2-2 -1-1

BB -2-2 66 -3-3 66 22 -3-3 -1-1 -1-1 -3-3 -1-1 -4-4 -3-3 11 -1-1 00 -2-2 00 -1-1 -3-3 -4-4 -1-1 -3-3 22

CC 00 -3-3 99 -3-3 -4-4 -2-2 -3-3 -3-3 -1-1 -3-3 -1-1 -1-1 -3-3 -3-3 -3-3 -3-3 -1-1 -1-1 -1-1 -2-2 -1-1 -2-2 -4-4

DD -2-2 66 -3-3 66 22 -3-3 -1-1 -1-1 -3-3 -1-1 -4-4 -3-3 11 -1-1 00 -2-2 00 -1-1 -3-3 -4-4 -1-1 -3-3 22

EE -1-1 22 -4-4 22 55 -3-3 -2-2 00 -3-3 11 -3-3 -2-2 00 -1-1 22 00 00 -1-1 -2-2 -3-3 -1-1 -2-2 55

FF -2-2 -3-3 -2-2 -3-3 -3-3 66 -3-3 -1-1 00 -3-3 00 00 -3-3 -4-4 -3-3 -3-3 -2-2 -2-2 -1-1 11 -1-1 33 -3-3

GG 00 -1-1 -3-3 -1-1 -2-2 -3-3 66 -2-2 -4-4 -2-2 -4-4 -3-3 00 -2-2 -2-2 -2-2 00 -2-2 -3-3 -2-2 -1-1 -3-3 -2-2

HH -2-2 -1-1 -3-3 -1-1 00 -1-1 -2-2 88 -3-3 -1-1 -3-3 -2-2 11 -2-2 00 00 -1-1 -2-2 -3-3 -2-2 -1-1 22 00

II -1-1 -3-3 -1-1 -3-3 -3-3 00 -4-4 -3-3 44 -3-3 22 11 -3-3 -3-3 -3-3 -3-3 -2-2 -1-1 33 -3-3 -1-1 -1-1 -3-3

KK -1-1 -1-1 -3-3 -1-1 11 -3-3 -2-2 -1-1 -3-3 55 -2-2 -1-1 00 -1-1 11 22 00 -1-1 -2-2 -3-3 -1-1 -2-2 11

LL -1-1 -4-4 -1-1 -4-4 -3-3 00 -4-4 -3-3 22 -2-2 44 22 -3-3 -3-3 -2-2 -2-2 -2-2 -1-1 11 -2-2 -1-1 -1-1 -3-3

MM -1-1 -3-3 -1-1 -3-3 -2-2 00 -3-3 -2-2 11 -1-1 22 55 -2-2 -2-2 00 -1-1 -1-1 -1-1 11 -1-1 -1-1 -1-1 -2-2

NN -2-2 11 -3-3 11 00 -3-3 00 11 -3-3 00 -3-3 -2-2 66 -2-2 00 00 11 00 -3-3 -4-4 -1-1 -2-2 00

PP -1-1 -1-1 -3-3 -1-1 -1-1 -4-4 -2-2 -2-2 -3-3 -1-1 -3-3 -2-2 -2-2 77 -1-1 -2-2 -1-1 -1-1 -2-2 -4-4 -1-1 -3-3 -1-1

QQ -1-1 00 -3-3 00 22 -3-3 -2-2 00 -3-3 11 -2-2 00 00 -1-1 55 11 00 -1-1 -2-2 -2-2 -1-1 -1-1 22

RR -1-1 -2-2 -3-3 -2-2 00 -3-3 -2-2 00 -3-3 22 -2-2 -1-1 00 -2-2 11 55 -1-1 -1-1 -3-3 -3-3 -1-1 -2-2 00

SS 11 00 -1-1 00 00 -2-2 00 -1-1 -2-2 00 -2-2 -1-1 11 -1-1 00 -1-1 44 11 -2-2 -3-3 -1-1 -2-2 00

TT 00 -1-1 -1-1 -1-1 -1-1 -2-2 -2-2 -2-2 -1-1 -1-1 -1-1 -1-1 00 -1-1 -1-1 -1-1 11 55 00 -2-2 -1-1 -2-2 -1-1

VV 00 -3-3 -1-1 -3-3 -2-2 -1-1 -3-3 -3-3 33 -2-2 11 11 -3-3 -2-2 -2-2 -3-3 -2-2 00 44 -3-3 -1-1 -1-1 -2-2

WW -3-3 -4-4 -2-2 -4-4 -3-3 11 -2-2 -2-2 -3-3 -3-3 -2-2 -1-1 -4-4 -4-4 -2-2 -3-3 -3-3 -2-2 -3-3 1111 -1-1 22 -3-3

XX -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1

YY -2-2 -3-3 -2-2 -3-3 -2-2 33 -3-3 22 -1-1 -2-2 -1-1 -1-1 -2-2 -3-3 -1-1 -2-2 -2-2 -2-2 -1-1 22 -1-1 77 -2-2

ZZ -1-1 22 -4-4 22 55 -3-3 -2-2 00 -3-3 11 -3-3 -2-2 00 -1-1 22 00 00 -1-1 -2-2 -3-3 -1-1 -2-2 55

We can imagine screening databases for sequences We can imagine screening databases for sequences

similar to ours using these concepts of dynamic similar to ours using these concepts of dynamic

programming and substitution scoring matrices and programming and substitution scoring matrices and

some yet to be described algorithmic tricks. But what do some yet to be described algorithmic tricks. But what do

database searches tell us; what can we gain from them?database searches tell us; what can we gain from them?

Why even bother? Why even bother? Inference Inference

through homology is a fundamental through homology is a fundamental

principle of biologyprinciple of biology!!

When a sequence is found to fall into a preexisting family When a sequence is found to fall into a preexisting family

we may be able recognize genes, and infer function, we may be able recognize genes, and infer function,

regulation, mechanism, evolution, and perhaps even regulation, mechanism, evolution, and perhaps even

structure, based on homology with its neighbors.structure, based on homology with its neighbors.

So, first — So, first — significancesignificance: :

when is any alignment worth when is any alignment worth

anything biologically?anything biologically?

An old statistics trick — An old statistics trick — Monte CarloMonte Carlo simulations: simulations:

Z scoreZ score = [ = [ ( actual score ) - ( mean of randomized scores )( actual score ) - ( mean of randomized scores ) ] ]

( standard deviation of randomized score distribution )( standard deviation of randomized score distribution )

Independent of all that, what is a Independent of all that, what is a

‘good’ alignment?‘good’ alignment?

The The NormalNormal distributiondistribution

Many Z scores measure the distance from the mean Many Z scores measure the distance from the mean

using this simplistic Monte Carlo model assuming a using this simplistic Monte Carlo model assuming a

Gaussian distribution, a.k.a. the Normal distribution Gaussian distribution, a.k.a. the Normal distribution

((http://mathworld.wolfram.com/NormalDistribution.html),http://mathworld.wolfram.com/NormalDistribution.html),

in spite of the fact that ‘sequence-space’ actually in spite of the fact that ‘sequence-space’ actually

follows what is know as the ‘Extreme Value follows what is know as the ‘Extreme Value

distribution.’distribution.’

However, the Monte Carlo method does approximate However, the Monte Carlo method does approximate

significance estimates fairly well.significance estimates fairly well.

< 20 650 0:==

< 20 650 0:==

22 0 0:

22 0 0:

24 3 0:=

24 3 0:=

26 22 8:*

26 22 8:*

28 98 87:*

28 98 87:*

30 289 528:*

30 289 528:*

32 1714 2042:===*

32 1714 2042:===*

34 5585 5539:=========*

34 5585 5539:=========*

36 12495 11375:==================*==

36 12495 11375:==================*==

38 21957 18799:===============================*=====

38 21957 18799:===============================*=====

40 28875 26223:===========================================*====

40 28875 26223:===========================================*====

42 34153 32054:=====================================================*===

42 34153 32054:=====================================================*===

44 35427 35359:==========================================================*

44 35427 35359:==========================================================*

46 36219 36014:===========================================================*

46 36219 36014:===========================================================*

48 33699 34479:======================================================== *

48 33699 34479:======================================================== *

50 30727 31462:=================================================== *

50 30727 31462:=================================================== *

52 27288 27661:=============================================*

52 27288 27661:=============================================*

54 22538 23627:====================================== *

54 22538 23627:====================================== *

56 18055 19736:============================== *

56 18055 19736:============================== *

58 14617 16203:========================= *

58 14617 16203:========================= *

60 12595 13125:=====================*

60 12595 13125:=====================*

62 10563 10522:=================*

62 10563 10522:=================*

64 8626 8368:=============*=

64 8626 8368:=============*=

66 6426 6614:==========*

66 6426 6614:==========*

68 4770 5203:========*

68 4770 5203:========*

70 4017 4077:======*

70 4017 4077:======*

72 2920 3186:=====*

72 2920 3186:=====*

74 2448 2484:====*

74 2448 2484:====*

76 1696 1933:===*

76 1696 1933:===*

78 1178 1503:==*

78 1178 1503:==*

80 935 1167:=*

80 935 1167:=*

82 722 893:=*

82 722 893:=*

84 454 707:=*

84 454 707:=*

86 438 547:*

86 438 547:*

88 322 423:*

88 322 423:*

90 257 328:*

90 257 328:*

92 175 253:*

92 175 253:*

94 210 196:*

94 210 196:*

96 102 152:*

96 102 152:*

98 63 117:*

98 63 117:*

100 58 91:*

100 58 91:*

102 40 70:*

102 40 70:*

104 30 54:*

104 30 54:*

106 17 42:*

106 17 42:*

108 14 33:*

108 14 33:*

110 14 25:*

110 14 25:*

112 12 20:*

112 12 20:*

114 9 15:*

114 9 15:*

116 6 12:*

116 6 12:*

118 8 9:*

118 8 9:*

>120 1030 7:*=

>120 1030 7:*=

Based on this known statistical Based on this known statistical

distribution, and robust distribution, and robust

statistical methodology, a statistical methodology, a

realistic realistic ExpectationExpectation function, function,

the the E ValueE Value, can be calculated , can be calculated

from database searches.from database searches.

The ‘take-home’ message is . . .The ‘take-home’ message is . . .

‘‘Sequence-space’ Sequence-space’ (Huh, what’s that?)(Huh, what’s that?)

actually follows the ‘Extreme Value distribution’actually follows the ‘Extreme Value distribution’((http://mathworld.wolfram.com/ExtremeValueDistribution.html).http://mathworld.wolfram.com/ExtremeValueDistribution.html).

The Expectation Value!The Expectation Value!The higher the E value is, the more probable that the The higher the E value is, the more probable that the

observed match is due to chance in a search of the observed match is due to chance in a search of the

same size database, and the lower its Z score will be, same size database, and the lower its Z score will be,

i.e. is NOT significant. Therefore, the smaller the E i.e. is NOT significant. Therefore, the smaller the E

value, i.e. the closer it is to zero, the more significant it value, i.e. the closer it is to zero, the more significant it

is and the higher its Z score will be! The E value is the is and the higher its Z score will be! The E value is the

number that really matters. number that really matters. In other words, in order to In other words, in order to

assess whether a given alignment constitutes evidence assess whether a given alignment constitutes evidence

for homology, it helps to know how strong an alignment for homology, it helps to know how strong an alignment

can be expected from chance alone.can be expected from chance alone.

Rules of thumb for a protein searchRules of thumb for a protein search

The Z score represents the number of standard deviations some The Z score represents the number of standard deviations some

particular alignment is from a distribution of random alignments particular alignment is from a distribution of random alignments

(often the Normal distribution).(often the Normal distribution).

They They very roughlyvery roughly correspond to the listed E Values (based on correspond to the listed E Values (based on

the Extreme Value distribution) for a typical protein sequence the Extreme Value distribution) for a typical protein sequence

similarity search through a database with ~250,000 protein similarity search through a database with ~250,000 protein

entries.entries.

On to the searchesOn to the searchesHow can you search the databases for similar How can you search the databases for similar

sequences, if pairwise alignments take Nsequences, if pairwise alignments take N22 time?! time?!

Significance and heuristics . . . Significance and heuristics . . .

Database searching programs use the two concepts of Database searching programs use the two concepts of dynamic programming and substitution scoring dynamic programming and substitution scoring matrices; however, dynamic programming takes far too matrices; however, dynamic programming takes far too long when used against most sequence databases with long when used against most sequence databases with a ‘normal’ computer. Remember a ‘normal’ computer. Remember how bighow big the the databases are!databases are!

Therefore, the programs use tricks to make things Therefore, the programs use tricks to make things happen faster. These tricks fall into two main happen faster. These tricks fall into two main categories, that of categories, that of hashinghashing, and that of , and that of approximationapproximation..

Corn beef hash? Huh . . .Corn beef hash? Huh . . .Hashing is the process of breaking your sequence into Hashing is the process of breaking your sequence into

small ‘words’ or ‘k-tuples’ (think all chopped up, just like small ‘words’ or ‘k-tuples’ (think all chopped up, just like

corn beef hash) of a set size and creating a ‘look-up’ corn beef hash) of a set size and creating a ‘look-up’

table with those words keyed to position numbers. table with those words keyed to position numbers.

Computers can deal with numbers way faster than they Computers can deal with numbers way faster than they

can deal with strings of letters, and this preprocessing can deal with strings of letters, and this preprocessing

step happens very quickly.step happens very quickly.

Then when any of the word positions match part of an Then when any of the word positions match part of an

entry in the database, that match, the ‘offset,’ is saved. entry in the database, that match, the ‘offset,’ is saved.

In general, hashing reduces the complexity of the search In general, hashing reduces the complexity of the search

problem from Nproblem from N22 for dynamic programming to N, the for dynamic programming to N, the

length of all the sequences in the database.length of all the sequences in the database.

OK. Heuristics . . . What’s that?OK. Heuristics . . . What’s that?Approximation techniques are collectively known as ‘heuristics.’ Approximation techniques are collectively known as ‘heuristics.’

Webster’s defines heuristic as “serving to guide, discover, or Webster’s defines heuristic as “serving to guide, discover, or

reveal; . . . but unproved or incapable of proof.”reveal; . . . but unproved or incapable of proof.”

In database similarity searching techniques the heuristic usually In database similarity searching techniques the heuristic usually

restricts the necessary search space by calculating some sort of a restricts the necessary search space by calculating some sort of a

statistic that allows the program to decide whether further scrutiny statistic that allows the program to decide whether further scrutiny

of a particular match should be pursued. This statistic may miss of a particular match should be pursued. This statistic may miss

things depending on the parameters set — that’s what makes it things depending on the parameters set — that’s what makes it

heuristic. heuristic. ‘Worthwhile’ results at the end are compiled and the ‘Worthwhile’ results at the end are compiled and the

longest alignment within the program’s restrictions is created.longest alignment within the program’s restrictions is created.

The exact implementation varies between the different programs, The exact implementation varies between the different programs,

but the basic idea follows in most all of them.but the basic idea follows in most all of them.

Two predominant versions exist: BLAST and FastTwo predominant versions exist: BLAST and Fast

Both return local alignments, and are not a single program, but Both return local alignments, and are not a single program, but

rather a family of programs with implementations designed to rather a family of programs with implementations designed to

compare a sequence to a database every which way.compare a sequence to a database every which way.

These include:These include:

1)1) a DNA sequence against a DNA database (not recommended unless a DNA sequence against a DNA database (not recommended unless

forced to do so because you are dealing with a non-translated region of forced to do so because you are dealing with a non-translated region of

the genome — DNA is just too darn noisy, only identity & four bases!),the genome — DNA is just too darn noisy, only identity & four bases!),

2)2) a translated (where the translation is done ‘on-the-fly’ in all six frames) a translated (where the translation is done ‘on-the-fly’ in all six frames)

version of a DNA sequence against a translated (‘on-the-fly’ six-frame) version of a DNA sequence against a translated (‘on-the-fly’ six-frame)

version of the DNA database (not available in the Fast package),version of the DNA database (not available in the Fast package),

3)3) a translated (‘on-the-fly’ six-frame) version of a DNA sequence against a a translated (‘on-the-fly’ six-frame) version of a DNA sequence against a

protein database,protein database,

4)4) a protein sequence against a translated (‘on-the-fly’ six-frame) version a protein sequence against a translated (‘on-the-fly’ six-frame) version

of a DNA database,of a DNA database,

5)5) or a protein sequence against a protein database.or a protein sequence against a protein database.

Translated comparisons allow penalty-free frame shifts.Translated comparisons allow penalty-free frame shifts.

The BLAST and Fast programs — some generalitiesThe BLAST and Fast programs — some generalities

BLAST — Basic Local Alignment BLAST — Basic Local Alignment

Search Tool, developed at NCBI.Search Tool, developed at NCBI.

1)1) Normally NOT a good idea Normally NOT a good idea

to use for DNA against to use for DNA against

DNA searches w/o DNA searches w/o

translation (not optimized);translation (not optimized);

2)2) Pre-filters repeat and “low Pre-filters repeat and “low

complexity” sequence complexity” sequence

regions;regions;

4)4) Can find more than one Can find more than one

region of gapped similarity;region of gapped similarity;

5)5) Very fast heuristic and Very fast heuristic and

parallel implementation;parallel implementation;

6)6) Restricted to precompiled, Restricted to precompiled,

specially formatted specially formatted

databases;databases;

FastA — and its family of relatives, FastA — and its family of relatives,

developed by Bill Pearson at the developed by Bill Pearson at the

University of Virginia.University of Virginia.

1)1) Works well for DNA Works well for DNA

against DNA searches against DNA searches

(within limits of possible (within limits of possible

sensitivity);sensitivity);

2)2) Can find only one gapped Can find only one gapped

region of similarity;region of similarity;

3)3) Relatively slow, should Relatively slow, should

often be run in the often be run in the

background;background;

4)4) Does not require specially Does not require specially

prepared, preformatted prepared, preformatted

databases.databases.

The algorithms, very brieflyThe algorithms, very briefly

BLAST:BLAST:

Fast:Fast:

Two word hits on the Two word hits on the same diagonal above same diagonal above some some similaritysimilarity threshold triggers threshold triggers ungapped extension ungapped extension until the score isn’t until the score isn’t improved enough above improved enough above another threshold:another threshold:

the HSP.the HSP.

Find all ungapped Find all ungapped exact exact word hits; maximize the word hits; maximize the ten best continuous ten best continuous regions’ scores: regions’ scores: init1init1..

Combine non-Combine non-overlapping init overlapping init regions on different regions on different diagonals:diagonals:initninitn..

Use dynamic Use dynamic programming ‘in a programming ‘in a band’ for all regions band’ for all regions with with initninitn scores scores better than some better than some threshold: threshold: optopt score.score.

Initiate gapped extensions Initiate gapped extensions using dynamic programming for using dynamic programming for those HSP’s above a third those HSP’s above a third threshold up to the point where threshold up to the point where the score starts to drop below a the score starts to drop below a fourth threshold: yields fourth threshold: yields alignment.alignment.

What’s the deal with DNA versus protein for What’s the deal with DNA versus protein for searches and alignment?searches and alignment?

All database similarity searching and sequence alignment, All database similarity searching and sequence alignment,

regardless of the algorithm used, is far more sensitive at the amino regardless of the algorithm used, is far more sensitive at the amino

acid level than at the DNA level. This is because proteins have acid level than at the DNA level. This is because proteins have

twenty match criteria versus DNA’s four, and those four DNA twenty match criteria versus DNA’s four, and those four DNA

bases can generally only be identical, not similar, to each other; bases can generally only be identical, not similar, to each other;

and many DNA base changes (especially third position changes) and many DNA base changes (especially third position changes)

do not change the encoded protein.do not change the encoded protein.

All of these factors drastically increase the ‘noise’ level of a DNA All of these factors drastically increase the ‘noise’ level of a DNA

against DNA search, and give protein searches a much greater against DNA search, and give protein searches a much greater

‘look-back’ time, at least doubling it. ‘look-back’ time, at least doubling it.

Therefore, whenever dealing with coding sequence, it is always Therefore, whenever dealing with coding sequence, it is always

prudent to search at the protein level!prudent to search at the protein level!

More data yields stronger analyses — as More data yields stronger analyses — as long as it is done carefully!long as it is done carefully!

Mosaic ideas and evolutionary ‘importance.’Mosaic ideas and evolutionary ‘importance.’

Applications:Applications:

Probe, primer, and motif design;Probe, primer, and motif design;

Graphical illustrations;Graphical illustrations;

Comparative ‘homology’ inference;Comparative ‘homology’ inference;

Molecular evolutionary analysis.Molecular evolutionary analysis.

All right — how do you do it?All right — how do you do it?

What can we do with the significant results What can we do with the significant results of database searching — multiple sequence of database searching — multiple sequence alignment & analysis — alignment & analysis — why even bother?why even bother?

Dynamic programming’s complexity Dynamic programming’s complexity increases exponentially with the number of increases exponentially with the number of sequences being compared:sequences being compared:

N-dimensional matrix . . . .N-dimensional matrix . . . .complexity=[sequence length]complexity=[sequence length]number of sequencesnumber of sequences

i.e. complexity is i.e. complexity is OO((eenn))

Therefore, the most Therefore, the most

common implementation, common implementation,

pairwise, progressive pairwise, progressive

dynamic programming, dynamic programming,

restricts the solution to the restricts the solution to the

neighborhood of only two neighborhood of only two

sequences at a time.sequences at a time.

All sequences are All sequences are

compared, pairwise, and compared, pairwise, and

then each is aligned to its then each is aligned to its

most similar partner or most similar partner or

group of partners. Each group of partners. Each

group of partners is then group of partners is then

aligned to finish the aligned to finish the

complete multiple complete multiple

sequence alignment.sequence alignment.

Multiple Sequence Dynamic ProgrammingMultiple Sequence Dynamic Programming

Web resources for pairwise, Web resources for pairwise, progressive multiple alignmentprogressive multiple alignment

in the USA, include the Baylor College of in the USA, include the Baylor College of

Medicine’s Search Launcher —Medicine’s Search Launcher —

http://searchlauncher.bcm.tmc.edu/http://searchlauncher.bcm.tmc.edu/

However, problems with large datasets and However, problems with large datasets and

huge multiple alignments make doing multiple huge multiple alignments make doing multiple

sequence alignment on the Web impractical sequence alignment on the Web impractical

after your dataset has reached a certain size. after your dataset has reached a certain size.

You’ll know it when you’re there!You’ll know it when you’re there!

So, what else is available?So, what else is available?Stand-alone ClustalW is available for all Stand-alone ClustalW is available for all

operating systems; its graphical user interface operating systems; its graphical user interface

ClustalX, makes running it very easy.ClustalX, makes running it very easy.

And dedicated biocomputing server suites, like And dedicated biocomputing server suites, like

the GCG Wisconsin Package, which includes the GCG Wisconsin Package, which includes

PileUp and ClustalW and the SeqLab graphical PileUp and ClustalW and the SeqLab graphical

user interface, are another powerful solution.user interface, are another powerful solution.

Furthermore, newer software such as TCoffee, Furthermore, newer software such as TCoffee,

MUSCLE, ProbCons, POA, MAFFT, etc. add MUSCLE, ProbCons, POA, MAFFT, etc. add

various tweaks and tricks to make the entire various tweaks and tricks to make the entire

process more accurate and/or faster. process more accurate and/or faster.

Reliability and the Reliability and the Comparative ApproachComparative Approachexplicit homologous correspondence;explicit homologous correspondence;

manual adjustments based on manual adjustments based on knowledge,knowledge,

especially structural, regulatory, and especially structural, regulatory, and functional sites.functional sites.

Therefore, editors like SeqLab and Therefore, editors like SeqLab and structure based databses likestructure based databses like

the Ribosomal Database Project:the Ribosomal Database Project:

http://rdp.cme.msu.edu/index.jsphttp://rdp.cme.msu.edu/index.jsp

Structural & Functional correspondence in Structural & Functional correspondence in the Wisconsin Package’s SeqLabthe Wisconsin Package’s SeqLab

As with pairwise methods, work As with pairwise methods, work

with proteins! with proteins! If at all possibleIf at all possible

Twenty match symbols versus four, plus Twenty match symbols versus four, plus

similarity! Way better signal to noise.similarity! Way better signal to noise.

Also guarantees no indels are placed Also guarantees no indels are placed

within codons. So translate, then align.within codons. So translate, then align.

Nucleotide sequences will only reliably Nucleotide sequences will only reliably

align if they are align if they are veryvery similarsimilar to each to each

other. And they will require extensive other. And they will require extensive

hand editing and careful consideration.hand editing and careful consideration.

Beware of aligning apples and Beware of aligning apples and oranges oranges [[and grapefruitand grapefruit]]!!

Receptor versus Receptor versus activator, on activator, on ad ad nauseamnauseam;;

parologue versus parologue versus orthologue;orthologue;

genomic versus cDNA;genomic versus cDNA;

mature versus mature versus precursor.precursor.

Mask out uncertain areasMask out uncertain areas

Complications —Complications —

Order dependence.Order dependence.

Not that big of a deal.Not that big of a deal.

Substitution matrices and gap penalties.Substitution matrices and gap penalties.

A very big deal!A very big deal!

Regional ‘realignment’ becomes Regional ‘realignment’ becomes

incredibly important, especially with incredibly important, especially with

sequences that have areas of high and sequences that have areas of high and

low similaritylow similarity

Homology inference is especially Homology inference is especially powerful for finding genes and powerful for finding genes and functional and regulatory functional and regulatory domains within them!domains within them!The information within a multiple sequence The information within a multiple sequence

alignment can dramatically point to alignment can dramatically point to

evolutionarily constrained elements in the evolutionarily constrained elements in the

sequences. Furthermore, often functions can sequences. Furthermore, often functions can

experimentally be ascribed to them. experimentally be ascribed to them.

Therefore, we can search for those elements Therefore, we can search for those elements

in unknown sequences to attempt to identify in unknown sequences to attempt to identify

the unknown’s function. How does this work?the unknown’s function. How does this work?

The consensus and motifsThe consensus and motifsConserved Conserved regions in regions in alignments can alignments can be visualized be visualized with a sliding with a sliding window window approach and approach and appear as appear as peaks. peaks.

Refer to the peak Refer to the peak seen here in a seen here in a SRY/SOX SRY/SOX alignment.alignment.

HMG HMG boxbox

The HMG box DNA binding domain The HMG box DNA binding domain of SRY/SOXof SRY/SOX

A consensus isn’t A consensus isn’t

necessarily the necessarily the

biologically biologically

“correct” “correct”

combination.combination.

A simple A simple

consensus consensus

throws much throws much

information away!information away!

Therefore, motif Therefore, motif

definition.definition.consensus KRPMNAFMVYXKXXRRKIXXXXPXXHNXEISKRLGXXWKXLXXXEKXPYIXEAXRconsensus KRPMNAFMVYXKXXRRKIXXXXPXXHNXEISKRLGXXWKXLXXXEKXPYIXEAXR

PROSITE, a simple fast approachPROSITE, a simple fast approachThe trick is to define a motif such that it minimizes false positives The trick is to define a motif such that it minimizes false positives

and maximizes true positives — it needs to be just discriminatory and maximizes true positives — it needs to be just discriminatory

enough. Development is largely empirical; a pattern is made, enough. Development is largely empirical; a pattern is made,

tested against the database, then refined, over and over, although tested against the database, then refined, over and over, although

when experimental evidence is available, it is always incorporated. when experimental evidence is available, it is always incorporated.

This is known as motif definition and Amos Bairoch, has done it a This is known as motif definition and Amos Bairoch, has done it a

bunch!bunch!

His database of catalogued structural, regulatory, and enzymatic His database of catalogued structural, regulatory, and enzymatic

consensus patterns or ‘signatures’ is the consensus patterns or ‘signatures’ is the PROSITE Database of PROSITE Database of

protein families and domainsprotein families and domains and contains 1,510 documentation and contains 1,510 documentation

entries that describe 2,877 different patterns, rules, and entries that describe 2,877 different patterns, rules, and

profiles/matrices (Release 20.77, Feb. 26, 2008). Pattern profiles/matrices (Release 20.77, Feb. 26, 2008). Pattern

descriptions for these characteristic local sequence areas are descriptions for these characteristic local sequence areas are

variously and confusingly known as motifs, templates, signatures, variously and confusingly known as motifs, templates, signatures,

patterns, and even fingerprints.patterns, and even fingerprints.

The HMG box —The HMG box —Defined as:Defined as:

[FI]-S-[KR]-K-C-x-[FI]-S-[KR]-K-C-x-

[EK]-R-W-K-T-M.[EK]-R-W-K-T-M.

A one-dimensional A one-dimensional

‘regular-expression’ ‘regular-expression’

of a conserved site.of a conserved site.

Not necessarily Not necessarily

biologically biologically

meaningful though, meaningful though,

and motifs are and motifs are

limited in their ability limited in their ability

to discriminate a to discriminate a

residue’s residue’s

‘importance.’‘importance.’

QuickTime™ and aGraphics decompressor


Enter — two-dimensional techniquesEnter — two-dimensional techniques

for homology searching — the PSSM (position for homology searching — the PSSM (position

specific site matrix) and the ‘profile’ algorithms, specific site matrix) and the ‘profile’ algorithms,

including PsiBLAST, MEME, and HMMer . . .including PsiBLAST, MEME, and HMMer . . .

To do that we need to include ‘all’ of the To do that we need to include ‘all’ of the

information from the multiple sequence information from the multiple sequence

alignment, or of some region within the alignment, or of some region within the

alignment, in a description that doesn’t alignment, in a description that doesn’t

throw anything away!throw anything away!

HowHow do these work? do these work?

And to extend the 2D PSSM And to extend the 2D PSSM concept even further . . .concept even further . . .Michael Gribskov envisioned special weight matrices Michael Gribskov envisioned special weight matrices

in which conserved areas of the alignment receive in which conserved areas of the alignment receive

the most importance, variable regions hardly matter, the most importance, variable regions hardly matter,

and gaps are variably weighted depending where and gaps are variably weighted depending where

they are! These are often called “profiles.”they are! These are often called “profiles.”

A simple PSSM describing the TATA “Hogness” boxA simple PSSM describing the TATA “Hogness” box

A small piece of a profile —A small piece of a profile —

S 45 -3 -41 -8 -6 -84 -4 -42 -78 -7 -78 -43 38 -43 -6 -38 135 40 -71 -123 -28 -81 -6 -163 100 100S 45 -3 -41 -8 -6 -84 -4 -42 -78 -7 -78 -43 38 -43 -6 -38 135 40 -71 -123 -28 -81 -6 -163 100 100

K -49 -7 -146 -52 45 -145 -102 -50 -139 223 -92 -43 -5 -53 55 91 -4 -49 -92 -146 -44 -95 48 -199 100 100K -49 -7 -146 -52 45 -145 -102 -50 -139 223 -92 -43 -5 -53 55 91 -4 -49 -92 -146 -44 -95 48 -199 100 100

R -28 -41 -68 -57 -16 -61 -77 -38 -31 31 -20 2 -25 -56 8 37 -22 -14 -26 -80 -33 -49 -10 -123 100 100R -28 -41 -68 -57 -16 -61 -77 -38 -31 31 -20 2 -25 -56 8 37 -22 -14 -26 -80 -33 -49 -10 -123 100 100

L -66 -279 -69 -279 -209 -2 -278 -210 140 -141 275 137 -210 -209 -141 -141 -138 -69 71 -142 -108 -71 -209 -281 100 100L -66 -279 -69 -279 -209 -2 -278 -210 140 -141 275 137 -210 -209 -141 -141 -138 -69 71 -142 -108 -71 -209 -281 100 100

G 6 -63 -185 -62 -118 -187 360 -124 -246 -122 -246 -185 -4 -123 -121 -123 2 -122 -183 -129 -108 -186 -119 -252 100 100G 6 -63 -185 -62 -118 -187 360 -124 -246 -122 -246 -185 -4 -123 -121 -123 2 -122 -183 -129 -108 -186 -119 -252 100 100

K 2 -14 -75 -19 37 -76 -47 -23 -72 48 -58 -36 -13 -39 20 27 8 -21 -51 -87 -27 -53 30 -123 100 100K 2 -14 -75 -19 37 -76 -47 -23 -72 48 -58 -36 -13 -39 20 27 8 -21 -51 -87 -27 -53 30 -123 100 100

R -22 -39 -66 -41 2 -55 -70 -33 -34 7 -14 14 -27 -54 14 20 -17 -25 -29 -74 -31 -48 4 -120 100 100R -22 -39 -66 -41 2 -55 -70 -33 -34 7 -14 14 -27 -54 14 20 -17 -25 -29 -74 -31 -48 4 -120 100 100

W -300 -400 -200 -400 -300 100 -200 -200 -300 -300 -200 -100 -400 -400 -200 -300 -300 -200 -300 W -300 -400 -200 -400 -300 100 -200 -200 -300 -300 -200 -100 -400 -400 -200 -300 -300 -200 -300 11001100 -188 200 -300 -400 100 100 -188 200 -300 -400 100 100

K -42 14 -105 -25 24 -100 -58 4 -106 116 -79 -43 38 -47 30 59 4 -31 -82 -109 -31 -59 25 -142 100 100K -42 14 -105 -25 24 -100 -58 4 -106 116 -79 -43 38 -47 30 59 4 -31 -82 -109 -31 -59 25 -142 100 100

L -6 -41 -47 -48 -30 -46 -54 -49 -15 -25 7 4 -16 -59 -21 -33 -7 -12 -14 -80 -34 -49 -30 -122 100 100L -6 -41 -47 -48 -30 -46 -54 -49 -15 -25 7 4 -16 -59 -21 -33 -7 -12 -14 -80 -34 -49 -30 -122 100 100

Cons A B C D E F G H I K L M N P Q R S T V W X Y Z * Gap LenCons A B C D E F G H I K L M N P Q R S T V W X Y Z * Gap Len

The greatest conservation is the invariant tryptophan. It’s the only residue absolutely The greatest conservation is the invariant tryptophan. It’s the only residue absolutely

conserved — it gets the highest score, 1100! The -400 scores are from substituting that conserved — it gets the highest score, 1100! The -400 scores are from substituting that

tryptophan with an aspartate, asparagine, or proline. In the BLOSUM series tryptophan tryptophan with an aspartate, asparagine, or proline. In the BLOSUM series tryptophan

has the highest identity score of any residue, and the most negative substitution scores has the highest identity score of any residue, and the most negative substitution scores

include those from tryptophan to aspartate, asparagine, and proline, times the highest include those from tryptophan to aspartate, asparagine, and proline, times the highest

conservation in the region, equals the most negative scores in the profile.conservation in the region, equals the most negative scores in the profile.

The basic idea is to tabulate how often every possible character occurs at each The basic idea is to tabulate how often every possible character occurs at each

position, scale conserved positions up, variable positions down, and store the position, scale conserved positions up, variable positions down, and store the

whole thing in a matrix. With protein data it’ll be twenty residues wide, with whole thing in a matrix. With protein data it’ll be twenty residues wide, with

nucleic acids four bases wide, by the length of your pattern either way.nucleic acids four bases wide, by the length of your pattern either way.

Some profile variationsSome profile variationsAs powerful as ‘traditional’ As powerful as ‘traditional’ Gribskov style profiles are, they Gribskov style profiles are, they require a lot of time and skill to require a lot of time and skill to prepare and validate, and they prepare and validate, and they are heuristics based. Excess are heuristics based. Excess subjectivity and a lack of formal subjectivity and a lack of formal statistical rigor contribute as statistical rigor contribute as drawbacks. Sean Eddy drawbacks. Sean Eddy developed the HMMer package, developed the HMMer package, which uses Hidden Markov which uses Hidden Markov modeling, with a formal modeling, with a formal probabilistic basis and consistent probabilistic basis and consistent gap insertion theory, to build and gap insertion theory, to build and manipulate HMMer profiles and manipulate HMMer profiles and profile databases, to search profile databases, to search sequences against HMMer sequences against HMMer profile databases and visa versa, profile databases and visa versa, and to easily create multiple and to easily create multiple sequence alignments using sequence alignments using HMMer profiles as a ‘seed.’HMMer profiles as a ‘seed.’



Profile variations, continuedProfile variations, continuedBailey and Elkan’s Expectation Maximization (MEME) uses Bayesian Bailey and Elkan’s Expectation Maximization (MEME) uses Bayesian

probabilities and unsupervised learning to find, probabilities and unsupervised learning to find, de novode novo, unknown , unknown

conserved motifs among a group of unaligned, ungapped sequences. conserved motifs among a group of unaligned, ungapped sequences.

The motifs do not have to be in congruent order among the different The motifs do not have to be in congruent order among the different

sequences; i.e. it has the power to discover ‘unalignable’ motifs between sequences; i.e. it has the power to discover ‘unalignable’ motifs between

sequences. This characteristic differentiates MEME from the other profile sequences. This characteristic differentiates MEME from the other profile

building techniques. It can be particularly effective in discovering building techniques. It can be particularly effective in discovering

regulatory elements in common between co-regulated genes.regulatory elements in common between co-regulated genes.



If large datasets become intractable for analysis on the Web, what other resources are available?Desktop software solutions — public domain

programs are available, but . . . complicated to install, configure, and maintain. User must be pretty computer savvy. So,

commercial software packages are available, e.g. MacVector, DS Gene, DNAsis, DNAStar, etc.,

but . . . license hassles, big expense per machine, and Internet and/or CD database access all complicate matters!

Therefore, UNIX server-based solutions

Public domain solutions also exist, but now a very cooperative

systems manager needs to maintain everything for users, so,

commercial products, e.g. the Accelrys GCG Wisconsin Package

and the SeqLab Graphical User Interface, simplify matters for

administrators and users. One format, one ‘look-and-feel.’

One license fee for an entire institution and very fast, convenient

database access on local server disks. Connections from any

networked terminal or workstation anywhere!

Operating system: UNIX command line operation hassles;

communications software — telnet, ssh, and terminal emulation; X

graphics; file transfer — ftp, and scp/sftp; and editors — vi, emacs,

pico/nano (or desktop word processing followed by file transfer

[save as "text only!"]). See my supplement pdf file.

The Genetics Computer Group — The Accelrys Wisconsin Package for Sequence Analysis

GCG began in 1982 in Oliver Smithies’ Genetics Dept. lab at the

University of Wisconsin, Madison; and then starting in 1990 it

became a private company; which was acquired by the Oxford

Molecular Group, U.K., in 1997; and then by Pharmacopeia Inc.,

U.S.A., in 2000; and then in 2004 Accelrys, San Diego,

California, left Pharmacopeia to become an independent entity.

Tragically Accelrys has decided to ‘retire’ the product and

concentrate more on ‘big-buck’ drug-design software.

The suite contains around 150 programs designed to work in a

“toolbox” fashion. Several simple programs used in succession

can lead to very sophisticated results.

Also ‘internal compatibility,’ i.e. once you learn to use one program,

all programs can be run similarly, and, the output from many

programs can be used as input for other programs.

To answer the always perplexing GCG question — “What sequence(s)? . . . .”

The sequence is in a local GCG format single sequence file in your UNIX account. (GCG Reformat and SeqConv+ programs)

The sequence is in a local GCG database in which case you ‘point’ to it by using any of the GCG database logical names. A colon, “:,” always sets the logical name apart from either an accession number or a proper identifier name or a wildcard expression, and they are case insensitive.

The sequence is in a GCG format multiple sequence file, either an MSF (multiple sequence format) file or an RSF (rich sequence format) file. To specify sequences contained in a GCG multiple sequence file, supply the file name followed by a pair of braces, “{},” containing the sequence specification, e.g. a wildcard — {*}.

Finally, the most powerful method of specifying sequences is in a GCG “list” file. It is merely a list of other sequence specifications and can even contain other list files within it. The convention to use a GCG list file in a program is to precede it with an at sign, “@.” Furthermore, you can supply attribute information within list files to specify something special about the sequence such as begin and end constraints.

Specifying sequences, GCG style;Specifying sequences, GCG style;in order of increasing power and complexity:in order of increasing power and complexity:

!!NA_SEQUENCE 1.0!!NA_SEQUENCE 1.0

This is a small example of GCG single sequence format.This is a small example of GCG single sequence format.

Always put some documentation on top, so in the futureAlways put some documentation on top, so in the future

you can figure out what it is you're dealing with! Theyou can figure out what it is you're dealing with! The

line with the two periods is converted to the checksum line.line with the two periods is converted to the checksum line.

example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..

1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA

51 GATTTAATAG CATGCGATCC CATGGGA51 GATTTAATAG CATGCGATCC CATGGGA

‘‘Clean’ GCG format single sequence file after Clean’ GCG format single sequence file after

‘reformat’ (or the SeqConv+ program)‘reformat’ (or the SeqConv+ program)

SeqLab’s Editor mode can also SeqLab’s Editor mode can also

“Import” native GenBank format and “Import” native GenBank format and

ABI/LI-COR style binary trace files!ABI/LI-COR style binary trace files!

Logical terms for the Wisconsin PackageSequence databases, nucleic acids: Sequence databases, amino acids:

GENBANKPLUS all of GenBank plus EST, HTC & GSS subdivisions GENPEPT GenBank CDS translations

GBP all of GenBank plus EST, HTC & GSS subdivisions GP GenBank CDS translations

GENBANK all of GenBank except EST, HTC & GSS subdivisions UNIPROT or UNI all of Swiss-Prot and all of SPTrEMBL

GB all of GenBank except EST, HTC & GSS subdivisions SWISSPROTPLUS all of Swiss-Prot and all of SPTrEMBL

BA GenBank bacterial subdivision SWP all of Swiss-Prot and all of SPTrEMBL

BACTERIAL GenBank bacterial subdivision UNISPROT all of Swiss-Prot (fully annotated)

EST GenBank EST (Expressed Sequence Tags) subdivision SWISSPROT all of Swiss-Prot (fully annotated)

GSS GenBank GSS (Genome Survey Sequences) subdivision SWISS all of Swiss-Prot (fully annotated)

HTC GenBank High Throughput cDNA SW all of Swiss-Prot (fully annotated)

HTG GenBank High Throughput Genomic UNITREMBL Swiss-Prot preliminary EMBL translations

IN GenBank invertebrate subdivision SPTREMBL Swiss-Prot preliminary EMBL translations

INVERTEBRATE GenBank invertebrate subdivision SPT Swiss-Prot preliminary EMBL translations

OM GenBank other mammalian subdivision

OTHERMAMM GenBank other mammalian subdivision

OV GenBank other vertebrate subdivision

OTHERVERT GenBank other vertebrate subdivision

PAT GenBank patent subdivision

PATENT GenBank patent subdivision

PH GenBank phage subdivision

PHAGE GenBank phage subdivision

PL GenBank plant subdivision

PLANT GenBank plant subdivision

PR GenBank primate subdivision

PRIMATE GenBank primate subdivision

RO GenBank rodent subdivision

RODENT GenBank rodent subdivision

STS GenBank (Sequence Tagged Sites) subdivision

SY GenBank synthetic subdivision

SYNTHETIC GenBank synthetic subdivision

TAGS GenBank EST, HTC & GSS subdivisions

UN GenBank unannotated subdivision

UNANNOTATED GenBank unannotated subdivision

VI GenBank viral subdivision

VIRAL GenBank viral subdivision

These are easy — These are easy — they make sense and they make sense and you’ll have a vested you’ll have a vested interest. Just interest. Just remember to use the remember to use the colon/specifier syntax colon/specifier syntax (e.g. gb:* for all of (e.g. gb:* for all of GenBank less Tags).GenBank less Tags).

GCG MSF & RSF format

The trick is to not forget the Braces and ‘wild card,’ e.g.

filename{*}, when specifying!

!!RICH_SEQUENCE 1.0!!RICH_SEQUENCE 1.0....{{name ef1a_gialaname ef1a_gialadescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listdescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listtype PROTEINtype PROTEINlongname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}sequence-ID Q08046sequence-ID Q08046checksum 7342checksum 7342offset 23offset 23creation-date 07/11/2001 16:51:19creation-date 07/11/2001 16:51:19strand 1strand 1comments ////////////////////////////////////////////////////////////comments ////////////////////////////////////////////////////////////

!!AA_MULTIPLE_ALIGNMENT 1.0!!AA_MULTIPLE_ALIGNMENT 1.0

small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..

Name: a49171 Len: 425 Check: 537 Weight: 1.00Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00Name: a46241 Len: 274 Check: 3514 Weight: 1.00

// //////////////////////////////////////////////////// //////////////////////////////////////////////////

This is SeqLab’s native formatThis is SeqLab’s native format

The List File Format

!!SEQUENCE_LIST 1.0

An example GCG list file of many elongation 1a and Tu factors follows. As with all GCG data files, two periods separate documentation from data. ..

my-special.pep begin:24 end:134

SwissProt:EfTu_Ecoli

Ef1a-Tu.msf{*}

/usr/accounts/test/another.rsf{ef1a_*}

@another.list The ‘way’ SeqLab works!The ‘way’ SeqLab works!

remember the @ sign!remember the @ sign!

SeqLab — GCG’s X-based GUI!

SeqLab is the merger of Steve Smith’s Genetic

Data Environment and GCG’s Wisconsin

Package Interface:

GDE + WPI = SeqLab

Requires an X-Windowing environment —

either native on UNIX computers (including

LINUX, but not installed by default on Mac OS

X [v.10+] systems, however, see Apple’s free

X11 package or XDarwin), or with X-Server

emulation software on MS Windows computers.

There’s a bewildering assortment of bioinformatics databases and ways to There’s a bewildering assortment of bioinformatics databases and ways to access and manipulate the information within them. The key is to learn access and manipulate the information within them. The key is to learn how to use the data and the methods in the most efficient mannerhow to use the data and the methods in the most efficient manner! The ! The better you understand the chemical, physical, and biological systems better you understand the chemical, physical, and biological systems involved, the better your chance of success in analyzing them. Certain involved, the better your chance of success in analyzing them. Certain strategies are inherently more appropriate to others in certain strategies are inherently more appropriate to others in certain circumstances. Making these types of subjective, discriminatory decisions circumstances. Making these types of subjective, discriminatory decisions is one of the most important ‘take-home’ messages I can offer!is one of the most important ‘take-home’ messages I can offer!

Gunnar von Heijne in his old but incredibly readable treatise, Gunnar von Heijne in his old but incredibly readable treatise, Sequence Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), (1987), provides a very appropriate conclusion:provides a very appropriate conclusion:

““Think about what you’re doing; use your knowledge of the molecular Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and direction of inquiry; use as much information as possible; and do not do not blindly accept everything the computer offers youblindly accept everything the computer offers you.”.”

““. . . if any lesson is to be drawn . . . it surely is that to be able to make . . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and a useful contribution one must first and foremost be a biologist, and only second a theoretician . . . . We have to develop better algorithms, only second a theoretician . . . . We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and we have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.”above all we have to become better biologists. But that’s all it takes.”

Conclusions —Conclusions —

Selected references —Selected references —Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular BiologyJournal of Molecular Biology 215, 403-410. 215, 403-410.Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Generation of Protein Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Generation of Protein

Database Search Programs. Database Search Programs. Nucleic Acids ResearchNucleic Acids Research 25, 3389-3402. 25, 3389-3402.Bailey, T.L. and Elkan, C., (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers, in Bailey, T.L. and Elkan, C., (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers, in Proceedings of the Second International Proceedings of the Second International

Conference on Intelligent Systems for Molecular BiologyConference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, California, U.S.A. pp. 28–36., AAAI Press, Menlo Park, California, U.S.A. pp. 28–36.Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids ResearchNucleic Acids Research 20, 2013-2018. 20, 2013-2018.

Bucher, P. (1990). Weight Matrix Descriptions of Four Eukaryotic RNA Polymerase II Promoter Elements Derived from 502 Unrelated Promoter Sequences. Journal of Molecular Biology 212, 563-578; and Bucher, P. (1995). The Eukaryotic Promoter Database EPD. EMBL Nucleotide Sequence Data Library Release 42, Postfach 10.2209, D-6900 Heidelberg.

Eddy, S.R. (1996) Hidden Markov models. Eddy, S.R. (1996) Hidden Markov models. Current Opinion in Structural BiologyCurrent Opinion in Structural Biology 6, 361–365; and (1998) Profile hidden Markov models. 6, 361–365; and (1998) Profile hidden Markov models. BioinformaticsBioinformatics 14, 755-763 14, 755-763Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle, Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle,

Washington, U.S.A.Washington, U.S.A.Feng, D.F. and Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Feng, D.F. and Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular EvolutionJournal of Molecular Evolution 25, 351–360 . 25, 351–360 .Genetics Computer Group (GCG) (Copyright 1982-2007) Genetics Computer Group (GCG) (Copyright 1982-2007) Program Manual for the Wisconsin PackageProgram Manual for the Wisconsin Package, Version 11., Accelrys, Inc. San Diego, California, U.S.A., Version 11., Accelrys, Inc. San Diego, California, U.S.A.

Ghosh, D. (1990). A Relational Database of Transcription Factors. Nucleic Acids Research 18, 1749-1756.Gilbert, D.G. (1993 [C release] and 1999 [Java release]) ReadSeq, public domain software distributed by the author.Gilbert, D.G. (1993 [C release] and 1999 [Java release]) ReadSeq, public domain software distributed by the author. http://iubio.bio.indiana.edu/soft/molbio/readseq/ http://iubio.bio.indiana.edu/soft/molbio/readseq/

Bioinformatics Group, Biology Department, Indiana University, Bloomington, Indiana,U.S.A.Bioinformatics Group, Biology Department, Indiana University, Bloomington, Indiana,U.S.A.Gribskov, M. and Devereux, J., editors (1992) Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis PrimerSequence Analysis Primer. W.H. Freeman and Company, New York, New York, U.S.A.. W.H. Freeman and Company, New York, New York, U.S.A.Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A.Proc. Natl. Acad. Sci. U.S.A. 84, 4355-4358. 84, 4355-4358.

Hawley, D.K. and McClure, W.R. (1983). Compilation and Analysis of Escherichia coli promoter sequences. Nucleic Acids Research 11, 2237-2255.Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences U.S.A.Proceedings of the National Academy of Sciences U.S.A. 89, 10915- 89, 10915-

10919.10919.

Kozak, M. (1984). Compilation and Analysis of Sequences Upstream from the Translational Start Site in Eukaryotic mRNAs. Nucleic Acids Research 12, 857-872.

McLauchen, J., Gaffrey, D., Whitton, J. and Clements, J. (1985). The Consensus Sequences YGTGTTYY Located Downstream from the AATAAA Signal is Required for Efficient Formation of mRNA 3’ Termini. Nucleic Acid Research 13 , 1347-1368.

Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Journal of Molecular BiologyMolecular Biology 48, 443-453. 48, 443-453.

Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences U.S.A.Proceedings of the National Academy of Sciences U.S.A. 85, 2444-2448. 85, 2444-2448.

Proudfoot, N.J. and Brownlee, G.G. (1976). 3’ Noncoding Region in Eukaryotic Messenger RNA. Nature 263, 211-214.Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and StructureAtlas of Protein Sequences and Structure, (M.O. Dayhoff editor) 5, Suppl. 3, , (M.O. Dayhoff editor) 5, Suppl. 3,

353-358, National Biomedical Research Foundation, Washington D.C., U.S.A.353-358, National Biomedical Research Foundation, Washington D.C., U.S.A.Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied MathematicsAdvances in Applied Mathematics 2, 482-489. 2, 482-489.Swofford, D.L., PAUP* (Phylogenetic Analysis Using Parsimony and other methods) version 4.0+ (1989–2007) Florida State University, Tallahassee, Florida, U.S.A. Swofford, D.L., PAUP* (Phylogenetic Analysis Using Parsimony and other methods) version 4.0+ (1989–2007) Florida State University, Tallahassee, Florida, U.S.A.

http://paup.csit.fsu.edu/http://paup.csit.fsu.edu/ distributed through Sinaeur Associates, Inc. distributed through Sinaeur Associates, Inc. http://www.sinauer.com/http://www.sinauer.com/ Sunderland, Massachusetts, U.S.A. Sunderland, Massachusetts, U.S.A.

Stormo, G.D., Schneider, T.D. and Gold, L.M. (1982). Characterization of Translational Initiation Sites in E. coli. Nucleic Acids Research 10, 2971-2996.Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins, D.G. (1997) The ClustalX windows interface: flexible strategies for multiple sequence Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins, D.G. (1997) The ClustalX windows interface: flexible strategies for multiple sequence

alignment aided by quality analysis tools. alignment aided by quality analysis tools. Nucleic Acids ResearchNucleic Acids Research 24, 4876–4882. 24, 4876–4882.Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting,

positions-specific gap penalties and weight matrix choice. positions-specific gap penalties and weight matrix choice. Nucleic Acids ResearchNucleic Acids Research, 22, 4673-4680., 22, 4673-4680.

von Heijne, G. (1987a) Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press, Inc., San Diego, CA.

von Heijne, G. (1987b). SIGPEP: A Sequence Database for Secretory Signal Peptides. Protein Sequences & Data Analysis 1, 41-42.Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Sciences U.S.A.Proceedings of the National Academy of Sciences U.S.A. 80, 80,

726-730.726-730.

An Introduction to the GCG SeqLab GUI... some taste of theory, and a few practicalities Steve Thompson Steve Thompson Florida State University School of.

Documents

sequence database growth

sequence analysis tools

computational biology

computational techniques

molecular databases

type of biological database

complete genomes

biological system