Top Banner
Special Topics BSC4933/5936: Special Topics BSC4933/5936: An Introduction to An Introduction to Bioinformatics Bioinformatics . . Florida State University Florida State University The Department of Biological Science The Department of Biological Science www.bio.fsu.edu www.bio.fsu.edu
30

BioInformatics Databases

Jan 13, 2016

Download

Documents

Suki

Special Topics BSC4933/5936: An Introduction to Bioinformatics . Florida State University The Department of Biological Science www.bio.fsu.edu. BioInformatics Databases. Steven M. Thompson Florida State University School of Computational Science (SCS). So many Databases ????. NCBI ’s Entrez. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BioInformatics Databases

Special Topics BSC4933/5936:Special Topics BSC4933/5936:

An Introduction to BioinformaticsAn Introduction to Bioinformatics..

Florida State UniversityFlorida State University

The Department of Biological ScienceThe Department of Biological Science

www.bio.fsu.eduwww.bio.fsu.edu

Page 2: BioInformatics Databases

BioInformatics DatabasesBioInformatics Databases

Steven M. ThompsonSteven M. Thompson

Florida State University School of Florida State University School of Computational Science (SCS)Computational Science (SCS)

Page 3: BioInformatics Databases

NCBI’s NCBI’s

Entrez Entrez

Page 4: BioInformatics Databases

But first some of my definitions, lots of overlap —But first some of my definitions, lots of overlap —

BiocomputingBiocomputing and and computational biologycomputational biology are synonyms and are synonyms and describe the use of computers and computational techniques describe the use of computers and computational techniques to analyze any type of a biological system, from individual to analyze any type of a biological system, from individual molecules to organisms to overall ecology.molecules to organisms to overall ecology.

BioinformaticsBioinformatics describes using computational techniques to describes using computational techniques to access, analyze, and interpret the biological information in access, analyze, and interpret the biological information in any type of biological database.any type of biological database.

Sequence analysisSequence analysis is the study of molecular sequence data for is the study of molecular sequence data for the purpose of inferring the function, interactions, evolution, the purpose of inferring the function, interactions, evolution, and perhaps structure of biological molecules.and perhaps structure of biological molecules.

GenomicsGenomics analyzes the context of genes or complete genomes analyzes the context of genes or complete genomes (the total DNA content of an organism) within the same and/or (the total DNA content of an organism) within the same and/or across different genomes.across different genomes.

ProteomicsProteomics is the subdivision of genomics concerned with is the subdivision of genomics concerned with analyzing the complete protein complement, i.e. the proteome, analyzing the complete protein complement, i.e. the proteome, of organisms, both within and between different organisms.of organisms, both within and between different organisms.

Page 5: BioInformatics Databases

One way to think about the field —One way to think about the field —The Reverse Biochemistry Analogy.The Reverse Biochemistry Analogy.

Biochemists no longer have to begin a research project by Biochemists no longer have to begin a research project by

isolating and purifying massive amounts of a protein from isolating and purifying massive amounts of a protein from

its native organism in order to characterize a particular its native organism in order to characterize a particular

gene product. Rather, now scientists can amplify a gene product. Rather, now scientists can amplify a

section of some genome based on its similarity to other section of some genome based on its similarity to other

genomes, sequence that piece of DNA and, genomes, sequence that piece of DNA and, using using

sequence analysis tools, infer all sorts of functional, sequence analysis tools, infer all sorts of functional,

evolutionary, and, perhaps, structural insight into that evolutionary, and, perhaps, structural insight into that

stretch of DNA!stretch of DNA!

The computer and molecular databases are a The computer and molecular databases are a

necessary, integral part of this entire process.necessary, integral part of this entire process.

Page 6: BioInformatics Databases

The exponential growth of molecular sequence The exponential growth of molecular sequence databases databases & cpu power —& cpu power —YearYear BasePairsBasePairs SequencesSequences

19821982 680338680338 606606

19831983 22740292274029 24272427

19841984 33687653368765 41754175

19851985 52044205204420 57005700

19861986 96153719615371 99789978

19871987 1551477615514776 1458414584

19881988 2380000023800000 2057920579

19891989 3476258534762585 2879128791

19901990 4917928549179285 3953339533

19911991 7194742671947426 5562755627

19921992 101008486101008486 7860878608

19931993 157152442157152442 143492143492

19941994 217102462217102462 215273215273

19951995 384939485384939485 555694555694

19961996 651972984651972984 10212111021211

19971997 11603006871160300687 17658471765847

19981998 20087617842008761784 28378972837897

19991999 38411630113841163011 48645704864570

20002000 1110106628811101066288 1010602310106023

20012001 1584992143815849921438 1497631014976310

20022002 2850799016628507990166 2231888322318883

20032003 3655336848536553368485 3096841830968418

http://www.ncbi.nlm.nih.gov/ Genbank/genbankstats.html

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

doubling time ~doubling time ~one yearone year

Page 7: BioInformatics Databases

Database Growth (cont.) —Database Growth (cont.) —

The Human Genome Project and numerous smaller The Human Genome Project and numerous smaller

genome projects have kept the data coming at alarming genome projects have kept the data coming at alarming

rates. As of December 2004, almost 240 complete rates. As of December 2004, almost 240 complete

genomes are publicly available for analysis, not genomes are publicly available for analysis, not

counting all the virus and viroid genomes available.counting all the virus and viroid genomes available.

The International Human Genome Sequencing The International Human Genome Sequencing

Consortium announced the completion of the "Working Consortium announced the completion of the "Working

Draft" of the human genome in June 2000;Draft" of the human genome in June 2000;

Independently that same month, the private company Independently that same month, the private company

Celera GenomicsCelera Genomics announced that it had completed the announced that it had completed the

first “Assembly” of the human genome. Both articles first “Assembly” of the human genome. Both articles

were published mid-February 2001 in the journals were published mid-February 2001 in the journals

ScienceScience and and NatureNature..

Page 8: BioInformatics Databases

Some neat stuff from the papers —Some neat stuff from the papers —We, We, Homo sapiensHomo sapiens, aren’t nearly as special as , aren’t nearly as special as

we had hoped we were. Of the 3.2 billion we had hoped we were. Of the 3.2 billion base pairs in our DNA:base pairs in our DNA:

Traditional, text-book estimates of the number of genes Traditional, text-book estimates of the number of genes were often in the 100,000 range; turns out we’ve only were often in the 100,000 range; turns out we’ve only got about twice as many as a fruit fly, between 25’ and got about twice as many as a fruit fly, between 25’ and 35,000!35,000!

The protein coding region of the genome is only about The protein coding region of the genome is only about 1% or so, a bunch of the remainder is ‘jumping’ 1% or so, a bunch of the remainder is ‘jumping’ ‘selfish DNA’ of which much may be involved in ‘selfish DNA’ of which much may be involved in regulation and control.regulation and control.

Over 100-200 genes were transferred from an ancestral Over 100-200 genes were transferred from an ancestral bacterial genome to an ancestral vertebrate genome! bacterial genome to an ancestral vertebrate genome! ((Later shown to be not true by more extensive analyses, and to Later shown to be not true by more extensive analyses, and to

be due to gene loss rather than transfer.be due to gene loss rather than transfer.))

Page 9: BioInformatics Databases

These databases are an organized way to store the tremendous These databases are an organized way to store the tremendous amount of sequence information accumulating worldwide. Most have amount of sequence information accumulating worldwide. Most have their own specific format. An their own specific format. An ‘alphabet soup’ of t‘alphabet soup’ of three major database hree major database organizations around the world are responsible for maintaining most organizations around the world are responsible for maintaining most of this data. They largely ‘mirror’ one another and share accession of this data. They largely ‘mirror’ one another and share accession codes, but codes, but NOTNOT proper identifier names: proper identifier names:

North America: the National Center for Biotechnology Information (North America: the National Center for Biotechnology Information (NCBI), ), a division of the National Library of Medicine (NLM), at the National a division of the National Library of Medicine (NLM), at the National Institute of Health (NIH), has Institute of Health (NIH), has GenBank & GenPept. Also Georgetown & GenPept. Also Georgetown University’s National Biomedical Research Foundation (NBRF) Protein University’s National Biomedical Research Foundation (NBRF) Protein Identification Resource (Identification Resource (PIR) & ) & NRL_3D (Naval Research Lab (Naval Research Lab sequences of known three-dimensional structure).sequences of known three-dimensional structure).

Europe: the European Molecular Biology Laboratory (Europe: the European Molecular Biology Laboratory (EMBL), the European ), the European Bioinformatics Institute (Bioinformatics Institute (EBI), and the ), and the Swiss Institute of Bioinformatics’ Swiss Institute of Bioinformatics’ (SIB) Expert Protein Analysis System ((SIB) Expert Protein Analysis System (ExPasy), all help maintain the), all help maintain the EMBL Nucleotide Sequence Database, and Nucleotide Sequence Database, and the the SWISS-PROT & & TrEMBL amino acid sequence databases. amino acid sequence databases.

Asia: TAsia: The National Institute of Genetics (NIG) supports the National Institute of Genetics (NIG) supports the he Center for Center for Information Biology’s (CIG) Information Biology’s (CIG) DNA Data Bank of Japan (DNA Data Bank of Japan (DDBJ). ).

What are sequence databases?What are sequence databases?

Page 10: BioInformatics Databases

A little history —A little history —Developments that affect software and the end user —Developments that affect software and the end user —

The first well recognized sequence database was Dr. Margaret Dayhoff’s The first well recognized sequence database was Dr. Margaret Dayhoff’s hardbound hardbound Atlas of Protein Sequence and StructureAtlas of Protein Sequence and Structure begun in the mid- begun in the mid-sixties. sixties. DDBJDDBJ began in 1984, began in 1984, GenBankGenBank in 1982, and in 1982, and EMBLEMBL in 1980. in 1980. They are all attempts at establishing an organized, reliable, They are all attempts at establishing an organized, reliable, comprehensive and openly available library of genetic sequences. comprehensive and openly available library of genetic sequences. Databases have long-since outgrown a hardbound atlas. They have Databases have long-since outgrown a hardbound atlas. They have become huge and have evolved through many changes with many more become huge and have evolved through many changes with many more yet to come.yet to come.

Changes in format over the years are a major source of grief for software Changes in format over the years are a major source of grief for software designers and program users. Each program needs to be able to designers and program users. Each program needs to be able to recognize particular aspects of the sequence files; whenever they recognize particular aspects of the sequence files; whenever they change it throws a wrench in the works. NCBI’s change it throws a wrench in the works. NCBI’s ASN.1ASN.1 format and its format and its EntrezEntrez interface attempt to circumvent some of these frustrations. interface attempt to circumvent some of these frustrations. However, database format is much debated as many bioinformaticians However, database format is much debated as many bioinformaticians argue for relational or object-oriented standards. Unfortunately, until all argue for relational or object-oriented standards. Unfortunately, until all biologists and computer scientists worldwide agree on one standard and biologists and computer scientists worldwide agree on one standard and all software is (re)written to that standard, neither of which is likely to all software is (re)written to that standard, neither of which is likely to happen very quickly, format issues will remain probably the most happen very quickly, format issues will remain probably the most confusing and troubling aspect of working with primary sequence data.confusing and troubling aspect of working with primary sequence data.

Page 11: BioInformatics Databases

So what are these databases like?So what are these databases like?Just what are primary sequences?Just what are primary sequences?

(Central Dogma: DNA —> RNA —> protein)(Central Dogma: DNA —> RNA —> protein)

Primary refers to one dimension — all of the ‘symbol’ information Primary refers to one dimension — all of the ‘symbol’ information

written in sequential order necessary to specify a particular written in sequential order necessary to specify a particular

biological molecular entity, be it polypeptide or nucleotide.biological molecular entity, be it polypeptide or nucleotide.

The symbols are the one letter codes for all of the biological The symbols are the one letter codes for all of the biological

nitrogenous bases and amino acid residues and their ambiguity nitrogenous bases and amino acid residues and their ambiguity

codes. Biological carbohydrates, lipids, and structural and codes. Biological carbohydrates, lipids, and structural and

functional information are not sequence data. Not even DNA functional information are not sequence data. Not even DNA

translations in a DNA database!translations in a DNA database!

However, much of this feature and bibliographic type information However, much of this feature and bibliographic type information

is available in the reference documentation sections associated is available in the reference documentation sections associated

with primary sequences in the databases.with primary sequences in the databases.

Page 12: BioInformatics Databases

Sequence database installations are commonly a complex Sequence database installations are commonly a complex

ASCII/Binary mix, usually not relational or Object Oriented (but ASCII/Binary mix, usually not relational or Object Oriented (but

proprietary ones often are). They’ll contain several very long proprietary ones often are). They’ll contain several very long

text files each containing different types of information all text files each containing different types of information all

related to particular sequences, such as all of the sequences related to particular sequences, such as all of the sequences

themselves, versus all of the title lines, or all of the reference themselves, versus all of the title lines, or all of the reference

sections. Binary files often help ‘glue together’ all of these sections. Binary files often help ‘glue together’ all of these

other files by providing indexing functions. other files by providing indexing functions.

Software is usually required to successfully interact with these Software is usually required to successfully interact with these

databases and access is most easily handled through various databases and access is most easily handled through various

software packages and interfaces, either on the World Wide software packages and interfaces, either on the World Wide

Web or otherwise. Web or otherwise.

Content & Organization —Content & Organization —

Page 13: BioInformatics Databases

More organization stuff —More organization stuff —

Nucleic Acid DB’sNucleic Acid DB’s

GenBank/EMBL/DDBJGenBank/EMBL/DDBJ

all Taxonomic all Taxonomic

categories + HTC’s, categories + HTC’s,

HTG’s, & STS’sHTG’s, & STS’s

““Tags”Tags”

EST’sEST’s

GSS’sGSS’s

Amino Acid DB’sAmino Acid DB’sSWISS-PROTSWISS-PROT

TrEMBLTrEMBL

PIRPIR

PIR1PIR1

PIR2PIR2

PIR3PIR3

PIR4PIR4

NRL_3DNRL_3D

GenpeptGenpept

Nucleic acid sequence databases (and TrEMBL) are split into Nucleic acid sequence databases (and TrEMBL) are split into subdivisions based on taxonomy (historical rankings — the Fungi subdivisions based on taxonomy (historical rankings — the Fungi warning!). PIR is split into subdivisions based on level of warning!). PIR is split into subdivisions based on level of annotation. TrEMBL sequences are merged into SWISS-PROT annotation. TrEMBL sequences are merged into SWISS-PROT as they receive increased levels of annotation.as they receive increased levels of annotation.

Page 14: BioInformatics Databases

All sequence databases contain these elements:All sequence databases contain these elements:

NameName: LOCUS, ENTRY, ID all are unique identifiers: LOCUS, ENTRY, ID all are unique identifiers

DefinitionDefinition: A brief, one-line, textual sequence description.: A brief, one-line, textual sequence description.

Accession NumberAccession Number: A constant data identifier.: A constant data identifier.

Source and taxonomy information.Source and taxonomy information.

Complete literature references.Complete literature references.

Comments and keywords.Comments and keywords.

The all important The all important FEATUREFEATURE table! table!

A summary or checksum line.A summary or checksum line.

The The sequencesequence itself. itself.

But:But:

Each major database as well as each major suite of software tools Each major database as well as each major suite of software tools

that you are likely to use has its own distinct format requirements. that you are likely to use has its own distinct format requirements.

This can be a huge problem and an enormous time sink, even with This can be a huge problem and an enormous time sink, even with

helpful tools such as Don Gilbert’s helpful tools such as Don Gilbert’s ReadSeqReadSeq. Therefore, becoming . Therefore, becoming

familiar with some of the common formats is a big help. Look for key familiar with some of the common formats is a big help. Look for key

features of each type of entry:features of each type of entry:

Parts and problems —Parts and problems —

Page 15: BioInformatics Databases

Gen

Ban

k and GenP

ept format —

LOCUSLOCUS HSEF1AR 1506 bp mRNA linear PRI 12-SEP-1993 HSEF1AR 1506 bp mRNA linear PRI 12-SEP-1993

DEFINITION Human mRNA for elongation factor 1 alpha subunit (EF-1 alpha).DEFINITION Human mRNA for elongation factor 1 alpha subunit (EF-1 alpha).

ACCESSIONACCESSION X03558 X03558

VERSION X03558.1 GI:31097VERSION X03558.1 GI:31097

KEYWORDS elongation factor; elongation factor 1.KEYWORDS elongation factor; elongation factor 1.

SOURCE human.SOURCE human.

ORGANISM Homo sapiensORGANISM Homo sapiens

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.

REFERENCE 1 (bases 1 to 1506)REFERENCE 1 (bases 1 to 1506)

AUTHORS Brands,J.H., Maassen,J.A., van Hemert,F.J., Amons,R. and Moller,W.AUTHORS Brands,J.H., Maassen,J.A., van Hemert,F.J., Amons,R. and Moller,W.

TITLE The primary structure of the alpha subunit of human elongation……TITLE The primary structure of the alpha subunit of human elongation……

JOURNAL Eur. J. Biochem. 155 (1), 167-171 (1986)JOURNAL Eur. J. Biochem. 155 (1), 167-171 (1986)

MEDLINE 86136120MEDLINE 86136120

FEATURESFEATURES Location/Qualifiers Location/Qualifiers

source 1..1506source 1..1506

/organism="Homo sapiens"/organism="Homo sapiens"

/db_xref="taxon:9606"/db_xref="taxon:9606"

CDSCDS 54..1442 54..1442

/note="EF-1 alpha (aa 1-463)"/note="EF-1 alpha (aa 1-463)"

/codon_start=1/codon_start=1

/protein_id="CAA27245.1"/protein_id="CAA27245.1"

/db_xref="GI:31098"/db_xref="GI:31098"

/db_xref="SWISS-PROT:P04720"/db_xref="SWISS-PROT:P04720"

/translation="MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEK/translation="MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEK

EAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNM

…… ……VTKSAQKAQKAK"VTKSAQKAQKAK"

BASE COUNT 412 a 337 c 387 g 370 tBASE COUNT 412 a 337 c 387 g 370 t

ORIGINORIGIN

1 acgggtttgc cgccagaaca caggtgtcgt gaaaactacc cctaaaagcc aaaatgggaa1 acgggtttgc cgccagaaca caggtgtcgt gaaaactacc cctaaaagcc aaaatgggaa

61 aggaaaagac tcatatcaac attgtcgtca ttggacacgt agattcgggc aagtccacca……….61 aggaaaagac tcatatcaac attgtcgtca ttggacacgt agattcgggc aagtccacca……….

1501 aactgt1501 aactgt

////

Look for “LOCUS,” Look for “LOCUS,”

“FEATURES,” “FEATURES,”

“ORIGIN,” the “ORIGIN,” the

sequence itself, sequence itself,

and then “//.”and then “//.”

Page 16: BioInformatics Databases

EM

BL

and

SW

ISS

-PR

OT

form

at —E

MB

L an

d S

WIS

S-P

RO

T fo

rmat —

IDID EF11_HUMAN STANDARD; PRT; 462 AA. EF11_HUMAN STANDARD; PRT; 462 AA.ACAC P04720; P04719; P04720; P04719;DT 13-AUG-1987 (Rel. 05, Created)……DT 13-AUG-1987 (Rel. 05, Created)……DE Elongation factor 1-alpha 1 (EF-1-alpha-1) (Elongation factor 1 A-1)DE Elongation factor 1-alpha 1 (EF-1-alpha-1) (Elongation factor 1 A-1)DE (eEF1A-1) (Elongation factor Tu) (EF-Tu).DE (eEF1A-1) (Elongation factor Tu) (EF-Tu).GN EEF1A1 OR EEF1A OR EF1A.GN EEF1A1 OR EEF1A OR EF1A.OS Homo sapiens (Human),OS Homo sapiens (Human),OS Bos taurus (Bovine), andOS Bos taurus (Bovine), andOS Oryctolagus cuniculus (Rabbit).OS Oryctolagus cuniculus (Rabbit).OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.OX NCBI_TaxID=9606, 9913, 9986;OX NCBI_TaxID=9606, 9913, 9986;RN [1]RN [1]RP SEQUENCE FROM N.A.RP SEQUENCE FROM N.A.RC SPECIES=Human;RC SPECIES=Human;RX MEDLINE=86136120; PubMed=3512269;RX MEDLINE=86136120; PubMed=3512269;RA Brands J.H.G.M., Maassen J.A., van Hemert F.J., Amons R., Moeller W.;RA Brands J.H.G.M., Maassen J.A., van Hemert F.J., Amons R., Moeller W.;RT "The primary structure of the alpha subunit of human elongation …. -binding sites.";RT "The primary structure of the alpha subunit of human elongation …. -binding sites.";RL Eur. J. Biochem. 155:167-171(1986).……RL Eur. J. Biochem. 155:167-171(1986).……CC -!- FUNCTION: THIS PROTEIN PROMOTES THE GTP-DEPENDENT BINDING OFCC -!- FUNCTION: THIS PROTEIN PROMOTES THE GTP-DEPENDENT BINDING OFCC AMINOACYL-TRNA TO THE A-SITE OF RIBOSOMES DURING PROTEINCC AMINOACYL-TRNA TO THE A-SITE OF RIBOSOMES DURING PROTEINCC BIOSYNTHESIS.CC BIOSYNTHESIS.CC -!- SUBCELLULAR LOCATION: Cytoplasmic.CC -!- SUBCELLULAR LOCATION: Cytoplasmic.CC -!- TISSUE SPECIFICITY: BRAIN, PLACENTA, LUNG, LIVER, KIDNEY,CC -!- TISSUE SPECIFICITY: BRAIN, PLACENTA, LUNG, LIVER, KIDNEY,CC PANCREAS BUT BARELY DETECTABLE IN HEART AND SKELETAL MUSCLE.CC PANCREAS BUT BARELY DETECTABLE IN HEART AND SKELETAL MUSCLE.CC -!- SIMILARITY: BELONGS TO THE GTP-BINDING ELONGATION FACTOR FAMILY.CC -!- SIMILARITY: BELONGS TO THE GTP-BINDING ELONGATION FACTOR FAMILY.CC EF-TU/EF-1A SUBFAMILY……CC EF-TU/EF-1A SUBFAMILY……DR EMBL; X03558; CAA27245.1; -……DR EMBL; X03558; CAA27245.1; -……DR PIR; S18054; EFRB1……DR PIR; S18054; EFRB1……DR HSSP; Q01698; 1TUI……DR HSSP; Q01698; 1TUI……DR InterPro; IPR004160; GTP_EFTU_D3.DR InterPro; IPR004160; GTP_EFTU_D3.DR Pfam; PF00009; GTP_EFTU; 1……DR Pfam; PF00009; GTP_EFTU; 1……DR PROSITE; PS00301; EFACTOR_GTP; 1.DR PROSITE; PS00301; EFACTOR_GTP; 1.KW Elongation factor; Protein biosynthesis; GTP-binding; Methylation;KW Elongation factor; Protein biosynthesis; GTP-binding; Methylation;KW Multigene family.KW Multigene family.FTFT NP_BIND 14 21 GTP (BY SIMILARITY). NP_BIND 14 21 GTP (BY SIMILARITY).FTFT NP_BIND 91 95 GTP (BY SIMILARITY). NP_BIND 91 95 GTP (BY SIMILARITY).FTFT NP_BIND 153 156 GTP (BY SIMILARITY). NP_BIND 153 156 GTP (BY SIMILARITY).FTFT MOD_RES 36 36 METHYLATION (TRI-). MOD_RES 36 36 METHYLATION (TRI-).FTFT MOD_RES 55 55 METHYLATION (DI-). MOD_RES 55 55 METHYLATION (DI-).FTFT MOD_RES 79 79 METHYLATION (TRI-). MOD_RES 79 79 METHYLATION (TRI-).FTFT MOD_RES 165 165 METHYLATION (DI-). MOD_RES 165 165 METHYLATION (DI-).FTFT MOD_RES 318 318 METHYLATION (TRI-). MOD_RES 318 318 METHYLATION (TRI-).FTFT BINDING 301 301 ETHANOLAMINE-PHOSPHOGLYCEROL. BINDING 301 301 ETHANOLAMINE-PHOSPHOGLYCEROL.FTFT BINDING 374 374 ETHANOLAMINE-PHOSPHOGLYCEROL. BINDING 374 374 ETHANOLAMINE-PHOSPHOGLYCEROL.FTFT CONFLICT 83 83 S -> A (IN REF. 2). CONFLICT 83 83 S -> A (IN REF. 2).FTFT CONFLICT 232 232 L -> V (IN REF. 3). CONFLICT 232 232 L -> V (IN REF. 3).SQ SEQUENCE 462 AA; 50141 MW; D465615545AF686A CRC64;SQ SEQUENCE 462 AA; 50141 MW; D465615545AF686A CRC64; MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKEAAEMG KGSFKYAWVLMGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKEAAEMG KGSFKYAWVL DKLKAERERG …… VTKSAQKAQK AKDKLKAERERG …… VTKSAQKAQK AK////

Look for Look for

“ID,” “FT,” “ID,” “FT,”

“SQ,” the “SQ,” the

sequence, sequence,

and then “//.”and then “//.”

Page 17: BioInformatics Databases

PIR

CO

DA

TA

and

NB

RF

form

ats —P

IR C

OD

AT

A an

d N

BR

F fo

rmats —

ENTRYENTRY EFHU1 #type complete iProClass View of EFHU1 EFHU1 #type complete iProClass View of EFHU1TITLE translation elongation factor eEF-1 alpha-1 chain - humanTITLE translation elongation factor eEF-1 alpha-1 chain - human

(Annotation abrideged here)(Annotation abrideged here)FEATUREFEATURE 1-223 #domain eEF-1 alpha domain I, GTP-binding #status1-223 #domain eEF-1 alpha domain I, GTP-binding #status predicted #label EF1\predicted #label EF1\ 8-156 #domain translation elongation factor Tu homology8-156 #domain translation elongation factor Tu homology#label ETU\#label ETU\ 14-21 #region nucleotide-binding motif A (P-loop)\14-21 #region nucleotide-binding motif A (P-loop)\ 153-156 #region GTP-binding NKXD motif\153-156 #region GTP-binding NKXD motif\ 245-330 #domain eEF-1 alpha domain II, tRNA-binding245-330 #domain eEF-1 alpha domain II, tRNA-binding#status predicted #label EF2\#status predicted #label EF2\ 332-462 #domain eEF-1 alpha domain III, tRNA-binding332-462 #domain eEF-1 alpha domain III, tRNA-binding#status predicted #label EF3\#status predicted #label EF3\ 36,55,79,165,318 #modified_site N6,N6,N6-trimethyllysine (Lys)36,55,79,165,318 #modified_site N6,N6,N6-trimethyllysine (Lys)#status predicted\#status predicted\ 301,374 #binding_site glycerylphosphorylethanolamine 301,374 #binding_site glycerylphosphorylethanolamine (Glu) (covalent) #status predicted(Glu) (covalent) #status predictedSUMMARY #length 462 #molecular_weight 50141SUMMARY #length 462 #molecular_weight 50141

SEQUENCESEQUENCE 5 10 15 20 25 305 10 15 20 25 30 1 M G K E K T H I N I V V I G H V D S G K S T T T G H L I Y K1 M G K E K T H I N I V V I G H V D S G K S T T T G H L I Y K 31 C G G I D K R T I E K F E K E A A E M G K G S F K Y A W V L31 C G G I D K R T I E K F E K E A A E M G K G S F K Y A W V L 61 D K L K A E R E R …... Q K A Q K A K61 D K L K A E R E R …... Q K A Q K A K

>P1;EFHU1>P1;EFHU1pir1:efhu1 => EFHU1pir1:efhu1 => EFHU1 MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKEAAEMG MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKEAAEMG KGSFKYAWVL DKLKAERERG ITIDISLWKF ETSKYYVTII DAPGHRDFIK KGSFKYAWVL DKLKAERERG ITIDISLWKF ETSKYYVTII DAPGHRDFIK NMITGTSQAD CAVLIVAAGV GEFEAGISKN GQTREHALLA YTLGVKQLIV NMITGTSQAD CAVLIVAAGV GEFEAGISKN GQTREHALLA YTLGVKQLIV GVNKMDSTEP PYSQKRYEEI VKEVSTYIKK IGYNPDTVAF VPISGWNGDN GVNKMDSTEP PYSQKRYEEI VKEVSTYIKK IGYNPDTVAF VPISGWNGDN MLEPSANMPW FKGWKVTRKD GNASGTTLLE ALDCILPPTR PTDKPLRLPL MLEPSANMPW FKGWKVTRKD GNASGTTLLE ALDCILPPTR PTDKPLRLPL QDVYKIGGIG TVPVGRVETG VLKPGMVVTF APVNVTTEVK SVEMHHEALS QDVYKIGGIG TVPVGRVETG VLKPGMVVTF APVNVTTEVK SVEMHHEALS EALPGDNVGF NVKNVSVKDV RRGNVAGDSK NDPPMEAAGF TAQVIILNHP EALPGDNVGF NVKNVSVKDV RRGNVAGDSK NDPPMEAAGF TAQVIILNHP GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG PKFLKSGDAA GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG PKFLKSGDAA IVDMVPGKPM CVESFSDYPP LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK IVDMVPGKPM CVESFSDYPP LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK VTKSAQKAQK AK*VTKSAQKAQK AK*C;P1;EFHU1 - translation elongation factor eEF-1 alpha-1 chain - humanC;P1;EFHU1 - translation elongation factor eEF-1 alpha-1 chain - humanC;N;Alternate names: translation elongation factor TuC;N;Alternate names: translation elongation factor TuC;Species: Homo sapiens (man)C;Species: Homo sapiens (man)C;Date: 30-Jun-1988 #sequence_revision 05-Apr-1995 #text_change 19-Jan-2001C;Date: 30-Jun-1988 #sequence_revision 05-Apr-1995 #text_change 19-Jan-2001C;Accession: B24977; A25409; A29946; A32863; I37339C;Accession: B24977; A25409; A29946; A32863; I37339C;R;Rao, T.R.; Slobin, L.I. . . . C;R;Rao, T.R.; Slobin, L.I. . . .

Look for Look for

“ENTRY” and “ENTRY” and

“SEQUENCE” “SEQUENCE”

with numbers for with numbers for

CODATA;CODATA;

““>P1;” name, >P1;” name,

then definition then definition

line, then line, then

sequence, then sequence, then

annotation “C;” annotation “C;”

for NBRF protein for NBRF protein

format.format.

Page 18: BioInformatics Databases

Pe

arso

n F

as

tAP

ears

on

Fa

stA

form

at —

forma

t —

>EFHU1>EFHU1 PIR1 release 71.01 PIR1 release 71.01

MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGMGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMG

KGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIK

NMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIV

GVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDN

MLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPL

QDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALS

EALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHP

GQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAA

IVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGK

VTKSAQKAQKAKVTKSAQKAQKAK

GC

G sin

gle

seq

ue

nc

eG

CG

sing

le s

equ

en

ce

form

at —

forma

t —

!!AA_SEQUENCE 1.0!!AA_SEQUENCE 1.0P1;EFHU1 - translation elongation factor eEF-1 alpha-1 chain - humanP1;EFHU1 - translation elongation factor eEF-1 alpha-1 chain - humanN;Alternate names: translation elongation factor Tu……N;Alternate names: translation elongation factor Tu……F;1-223/Domain: eEF-1 alpha domain I, GTP-binding #status predicted <EF1>F;1-223/Domain: eEF-1 alpha domain I, GTP-binding #status predicted <EF1>F;8-156/Domain: translation elongation factor Tu homology <ETU>F;8-156/Domain: translation elongation factor Tu homology <ETU>F;14-21/Region: nucleotide-binding motif A (P-loop)F;14-21/Region: nucleotide-binding motif A (P-loop)F;153-156/Region: GTP-binding NKXD motifF;153-156/Region: GTP-binding NKXD motifEFHU1 Length: 462 January 14, 2002 19:49 Type: P Check: 5308 ..EFHU1 Length: 462 January 14, 2002 19:49 Type: P Check: 5308 ..

1 MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKE……1 MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKE……

401 IVDMVPGKPM CVESFSDYPP LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK401 IVDMVPGKPM CVESFSDYPP LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK 351 GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG PKFLKSGDAA351 GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG PKFLKSGDAA 451 VTKSAQKAQK AK451 VTKSAQKAQK AK

Look for Look for “>”name, “>”name, start of start of definition definition line.line.

Only one Only one annotation annotation line allowed!line allowed!

Look for “!!” sequence type, then annotation, then sequence Look for “!!” sequence type, then annotation, then sequence

identifier name on the checksum line, then the sequence itself.identifier name on the checksum line, then the sequence itself.

Page 19: BioInformatics Databases

GC

G M

SF

& R

SF

form

at —G

CG

MS

F &

RS

F f o

rmat —

!!RICH_SEQUENCE 1.0!!RICH_SEQUENCE 1.0....{{name ef1a_gialaname ef1a_gialadescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listdescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listtype PROTEINtype PROTEINlongname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}sequence-ID Q08046sequence-ID Q08046checksum 7342checksum 7342offset 23offset 23creation-date 07/11/2001 16:51:19creation-date 07/11/2001 16:51:19strand 1strand 1comments …………….comments …………….

!!AA_MULTIPLE_ALIGNMENT 1.0!!AA_MULTIPLE_ALIGNMENT 1.0

small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..

Name: a49171 Len: 425 Check: 537 Weight: 1.00Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00Name: a46241 Len: 274 Check: 3514 Weight: 1.00

//// …………… ……………

This is SeqLab’s native formatThis is SeqLab’s native format

The other GCG formats — but these hold The other GCG formats — but these hold

more than one sequence at a time.more than one sequence at a time.

Page 20: BioInformatics Databases

Specialized ‘sequence’ -type DB’s —Specialized ‘sequence’ -type DB’s —

Databases that contain special types of sequence Databases that contain special types of sequence information, such as patterns, motifs, and profiles. information, such as patterns, motifs, and profiles. These include: REBASE, These include: REBASE, EPDEPD, , PROSITEPROSITE, , BLOCKSBLOCKS, , ProDomProDom, , PfamPfam . . . . . . . .

Databases that contain multiple sequence entries Databases that contain multiple sequence entries aligned, e.g. aligned, e.g. RDPRDP and and ALNALN..

Databases that contain families of sequences ordered Databases that contain families of sequences ordered functionally, structurally, or phylogenetically, e.g. functionally, structurally, or phylogenetically, e.g. iProClassiProClass and and HOVERGENHOVERGEN..

Databases of species specific sequences, e.g. the Databases of species specific sequences, e.g. the HIV DatabaseHIV Database and the and the Giardia lambliaGiardia lamblia Genome ProjectGenome Project..

And on and on . . . . See Amos Bairoch’s excellent links And on and on . . . . See Amos Bairoch’s excellent links page: page: http://us.http://us.expasyexpasy.org/.org/alinksalinks.html.html and the and the wonderful Human Genome Ensemble Project at wonderful Human Genome Ensemble Project at http://www.ensembl.org/http://www.ensembl.org/ that tries to tie it all together. that tries to tie it all together.

Page 21: BioInformatics Databases

What about other types of biological databases?What about other types of biological databases?

Three dimensional structure databases:Three dimensional structure databases:

the the Protein Data BankProtein Data Bank and and Rutgers Nucleic Acid DatabaseRutgers Nucleic Acid Database..

These databases contain all of the 3D atomic coordinate data These databases contain all of the 3D atomic coordinate data

necessary to define the tertiary shape of a particular biological necessary to define the tertiary shape of a particular biological

molecule. The data is usually experimentally derived, either by molecule. The data is usually experimentally derived, either by

X-ray crystallography or with NMR, but sometimes it is a X-ray crystallography or with NMR, but sometimes it is a

hypothetical model. In all cases the source of the structure and hypothetical model. In all cases the source of the structure and

its resolution is clearly indicated.its resolution is clearly indicated.

Secondary structure boundaries, sequence data, and reference Secondary structure boundaries, sequence data, and reference

information are often associated with the coordinate data, but it information are often associated with the coordinate data, but it

is the 3D data that really matters, not the annotation.is the 3D data that really matters, not the annotation.

Molecular visualization or modeling software is required to interact Molecular visualization or modeling software is required to interact

with the data. It has little meaning on its own. See Molecules with the data. It has little meaning on its own. See Molecules

to Go at to Go at http://http://molbiomolbio.info..info.nihnih..govgov//cgicgi-bin/-bin/pdbpdb// . .

Page 22: BioInformatics Databases

Other types of Biological DB’s —Other types of Biological DB’s —Still more; these can be considered ‘non-molecular’:Still more; these can be considered ‘non-molecular’:

Genomic linkage mapping databases for most large genome projects (w/ pointers to sequences)(w/ pointers to sequences) — H. sapiens, Mus, Drosophila, C. elegans, Saccharomyces, Arabidopsis, E. coli, . . . .

Reference Databases (also w/ pointers to sequences): e.g. Reference Databases (also w/ pointers to sequences): e.g.

OMIMOMIM — Online Mendelian Inheritance in Man — Online Mendelian Inheritance in Man

PubMedPubMed//MedLineMedLine — over 11 million citations from — over 11 million citations from more than 4 thousand bio/medical scientific journals. more than 4 thousand bio/medical scientific journals.

Phylogenetic Tree Databases: e.g. the Tree of Life.Phylogenetic Tree Databases: e.g. the Tree of Life.

Metabolic Pathway Databases: e.g. Metabolic Pathway Databases: e.g. WITWIT (What Is There) and (What Is There) and Japan’s GenomeNet Japan’s GenomeNet KEGGKEGG (the Kyoto Encyclopedia of Genes and (the Kyoto Encyclopedia of Genes and Genomes).Genomes).

Population studies data — which strains, where, etc.Population studies data — which strains, where, etc.

And then databases that many biocomputing people don’t even usually And then databases that many biocomputing people don’t even usually consider:consider:

e.g. GIS/GPS/remote sensing data, medical records, census counts, e.g. GIS/GPS/remote sensing data, medical records, census counts, mortality and birth rates . . . .mortality and birth rates . . . .

Page 23: BioInformatics Databases

So how do you access and manipulate all this data?So how do you access and manipulate all this data?Often on the InterNet over the World Wide Web:Often on the InterNet over the World Wide Web:

SiteSite URL (Uniform Resource Locator)URL (Uniform Resource Locator) ContentContent

Nat’l Center Biotech' Info'Nat’l Center Biotech' Info' http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/ databases/analysis/softwaredatabases/analysis/software

PIR/NBRFPIR/NBRF http://www-nbrf.georgetown.edu/http://www-nbrf.georgetown.edu/ protein sequence databaseprotein sequence database

IUBIO Biology ArchiveIUBIO Biology Archive http://iubio.bio.indiana.edu/http://iubio.bio.indiana.edu/ database/software archivedatabase/software archive

Univ. of MontrealUniv. of Montreal http://megasun.bch.umontreal.ca/http://megasun.bch.umontreal.ca/ database/software archivedatabase/software archive

Japan's GenomeNetJapan's GenomeNet http://www.genome.ad.jp/http://www.genome.ad.jp/ databases/analysis/softwaredatabases/analysis/software

European Mol' Bio' Lab'European Mol' Bio' Lab' http://www.embl-heidelberg.de/http://www.embl-heidelberg.de/ databases/analysis/softwaredatabases/analysis/software

European BioinformaticsEuropean Bioinformatics http://www.ebi.ac.uk/http://www.ebi.ac.uk/ databases/analysis/softwaredatabases/analysis/software

The Sanger InstituteThe Sanger Institute http://www.sanger.ac.uk/http://www.sanger.ac.uk/ databases/analysis/softwaredatabases/analysis/software

Univ. of Geneva BioWebUniv. of Geneva BioWeb http://www.expasy.ch/http://www.expasy.ch/ databases/analysis/softwaredatabases/analysis/software

ProteinDataBankProteinDataBank http://www.rcsb.org/pdb/http://www.rcsb.org/pdb/ 3D mol' structure database3D mol' structure database

Molecules to GoMolecules to Go http://molbio.info.nih.gov/cgi-bin/pdb/http://molbio.info.nih.gov/cgi-bin/pdb/ 3D protein/nuc' visualization3D protein/nuc' visualization

The Genome DataBaseThe Genome DataBase http://www.gdb.org/http://www.gdb.org/ The Human Genome ProjectThe Human Genome Project

Stanford GenomicsStanford Genomics http://genome-www.stanford.edu/http://genome-www.stanford.edu/ various genome projectsvarious genome projects

Inst. for Genomic Res’rchInst. for Genomic Res’rch http://www.tigr.org/http://www.tigr.org/ esp. microbial genome projectsesp. microbial genome projects

HIV Sequence DatabaseHIV Sequence Database http://hiv-web.lanl.gov/http://hiv-web.lanl.gov/ HIV epidemeology seq' DBHIV epidemeology seq' DB

The Tree of LifeThe Tree of Life http://tolweb.org/tree/phylogeny.htmlhttp://tolweb.org/tree/phylogeny.html overview of all phylogenyoverview of all phylogeny

Ribosomal Database Proj’Ribosomal Database Proj’ http://rdp.cme.msu.edu/index.jsphttp://rdp.cme.msu.edu/index.jsp databases/analysis/softwaredatabases/analysis/software

PUMA2 at ArgonnePUMA2 at Argonne http://compbio.mcs.anl.gov/puma2/cgi-bin/http://compbio.mcs.anl.gov/puma2/cgi-bin/ metabolic reconstructionmetabolic reconstruction

Harvard Bio' LaboratoriesHarvard Bio' Laboratories http://golgi.harvard.edu/http://golgi.harvard.edu/ nice bioinformatics links listnice bioinformatics links list

With a World Wide Web browser and tools like NCBI’s With a World Wide Web browser and tools like NCBI’s Entrez & EMBL’s & EMBL’s SRS

Page 24: BioInformatics Databases

Advantage: Can access the very latest updates. It’s fun and Advantage: Can access the very latest updates. It’s fun and

very fast. It can be very powerful and efficient, if you know very fast. It can be very powerful and efficient, if you know

what you’re doing.what you’re doing.

Disadvantage: Can be very inefficient, if you don’t know what Disadvantage: Can be very inefficient, if you don’t know what

you’re doing.you’re doing. Also format hassles Also format hassles, and . . . very easy to get , and . . . very easy to get

lost and/or distracted in cyberspace!lost and/or distracted in cyberspace!

AdditionallyAdditionally problems sometimes arise with the Net, like bad problems sometimes arise with the Net, like bad

connections. So what are some of the alternatives . . . ?connections. So what are some of the alternatives . . . ?

Desktop software solutions — public domain programs are Desktop software solutions — public domain programs are

available, but . . . complicated to install, configure, and maintain. available, but . . . complicated to install, configure, and maintain.

User must be pretty computer savvy. So, User must be pretty computer savvy. So,

commercial software packages are available, e.g. Sequencher, commercial software packages are available, e.g. Sequencher,

MacVector, DNAsis, DNAStar, etc.,MacVector, DNAsis, DNAStar, etc.,

but . . . license hassles, big expense per machine, and Internet but . . . license hassles, big expense per machine, and Internet

and/or CD database access all complicate matters!and/or CD database access all complicate matters!

Page 25: BioInformatics Databases

Therefore, server-based solutions — we’re talking Therefore, server-based solutions — we’re talking

UNIX server computers here.UNIX server computers here.Again public domain programs exist. But now a VERY Again public domain programs exist. But now a VERY

cooperative systems manager needs to install, configure, and cooperative systems manager needs to install, configure, and

maintain the system. Therefore a commercial package, e.g. maintain the system. Therefore a commercial package, e.g.

the Wisconsin Package, is often used to simplify matters.the Wisconsin Package, is often used to simplify matters.

One commercial license fee for an entire institution and very fast, One commercial license fee for an entire institution and very fast,

convenient database access on local server disks. convenient database access on local server disks.

Connections from any networked terminal or workstation Connections from any networked terminal or workstation

anywhere!anywhere!

Within the GCG suite, Within the GCG suite, LookUpLookUp is an SRS derivative used to find a is an SRS derivative used to find a

sequence of interest from local GCG server databases.sequence of interest from local GCG server databases.

Advantage: Search output is a legitimate GCG list file, appropriate Advantage: Search output is a legitimate GCG list file, appropriate

input to other GCG programs; no need to reformat — all GCG.input to other GCG programs; no need to reformat — all GCG.

Disadvantage: DB’s only as new as administrator maintains them.Disadvantage: DB’s only as new as administrator maintains them.

Page 26: BioInformatics Databases

The Genetics Computer Group — The Genetics Computer Group — the Wisconsin Package for Sequence Analysis.the Wisconsin Package for Sequence Analysis.

Begun in 1982 in Oliver Smithies’ lab at the Genetics Dept. Begun in 1982 in Oliver Smithies’ lab at the Genetics Dept. at the University of Wisconsin, Madison, then a private at the University of Wisconsin, Madison, then a private company for over 10 years, then acquired by the Oxford company for over 10 years, then acquired by the Oxford Molecular Group U.K., and now owned by Pharmacopeia Molecular Group U.K., and now owned by Pharmacopeia U.S.A. under the new name Accelrys, Inc.U.S.A. under the new name Accelrys, Inc.

The suite contains almost 150 programs designed to work in The suite contains almost 150 programs designed to work in a "toolbox" fashion. Several simple programs used in a "toolbox" fashion. Several simple programs used in succession can lead to sophisticated results.succession can lead to sophisticated results.

Also 'internal compatibility,' i.e. once you learn to use one Also 'internal compatibility,' i.e. once you learn to use one program, all programs can be run similarly, and, the program, all programs can be run similarly, and, the output from many programs can be used as input for output from many programs can be used as input for other programs.other programs.

Used all over the world by more than 30,000 scientists at Used all over the world by more than 30,000 scientists at over 530 institutions in 35 countries, so learning it here over 530 institutions in 35 countries, so learning it here will most likely be useful anywhere else you may end up.will most likely be useful anywhere else you may end up.

Page 27: BioInformatics Databases

To answer the always perplexing GCG question — “What To answer the always perplexing GCG question — “What sequence(s)? . . . .”sequence(s)? . . . .”

The sequence is in a local GCG format single sequence file in your UNIX The sequence is in a local GCG format single sequence file in your UNIX account. (GCG Reformat and all From & To programs)account. (GCG Reformat and all From & To programs)

The sequence is in a local GCG database in which case you ‘point’ to it by The sequence is in a local GCG database in which case you ‘point’ to it by using any of the GCG database logical names. A colon, “:,” always sets using any of the GCG database logical names. A colon, “:,” always sets the logical name apart from either an accession number or a proper the logical name apart from either an accession number or a proper identifier name or a wildcard expression and they are case insensitive.identifier name or a wildcard expression and they are case insensitive.

The sequence is in a GCG format multiple sequence file, either an MSF The sequence is in a GCG format multiple sequence file, either an MSF (multiple sequence format) file or an RSF (rich sequence format) file. To (multiple sequence format) file or an RSF (rich sequence format) file. To specify sequences contained in a GCG multiple sequence file, supply the specify sequences contained in a GCG multiple sequence file, supply the file name followed by a pair of braces, “{},” containing the sequence file name followed by a pair of braces, “{},” containing the sequence specification, e.g. a wildcard — {specification, e.g. a wildcard — {**}.}.

Finally, the most powerful method of specifying sequences is in a GCG “list” Finally, the most powerful method of specifying sequences is in a GCG “list” file. It is merely a list of other sequence specifications and can even file. It is merely a list of other sequence specifications and can even contain other list files within it. The convention to use a GCG list file in a contain other list files within it. The convention to use a GCG list file in a program is to precede it with an at sign, “@.” Furthermore, one can program is to precede it with an at sign, “@.” Furthermore, one can supply attribute information within list files to specify something special supply attribute information within list files to specify something special about the sequence.about the sequence.

Specifying sequences, GCG style;Specifying sequences, GCG style;in order of increasing power and complexity:in order of increasing power and complexity:

Page 28: BioInformatics Databases

Logical terms for the Wisconsin Package —Logical terms for the Wisconsin Package —Sequence databases, nucleic acids:Sequence databases, nucleic acids: Sequence databases, amino acids:Sequence databases, amino acids:

GENBANKPLUSGENBANKPLUS all of GenBank plus EST and GSS subdivisionsall of GenBank plus EST and GSS subdivisions GENPEPTGENPEPT GenBank CDS translationsGenBank CDS translations

GBPGBP all of GenBank plus EST and GSS subdivisionsall of GenBank plus EST and GSS subdivisions GPGP GenBank CDS translationsGenBank CDS translations

GENBANKGENBANK all of GenBank except EST and GSS subdivisionsall of GenBank except EST and GSS subdivisions SWISSPROTPLUSSWISSPROTPLUS all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL

GBGB all of GenBank except EST and GSS subdivisionsall of GenBank except EST and GSS subdivisions SWPSWP all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL

BABA GenBank bacterial subdivisionGenBank bacterial subdivision SWISSPROTSWISSPROT all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)

BACTERIALBACTERIAL GenBank bacterial subdivisionGenBank bacterial subdivision SWSW all of Swiss-Prot (fully annotated) all of Swiss-Prot (fully annotated)

ESTEST GenBank EST (Expressed Sequence Tags) subdivisionGenBank EST (Expressed Sequence Tags) subdivision SPTREMBLSPTREMBL Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations

GSSGSS GenBank GSS (Genome Survey Sequences) subdivisionGenBank GSS (Genome Survey Sequences) subdivision SPTSPT Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations

HTCHTC GenBank High Throughput cDNAGenBank High Throughput cDNA PP all of PIR Proteinall of PIR Protein

HTGHTG GenBank High Throughput GenomicGenBank High Throughput Genomic PIRPIR all of PIR Proteinall of PIR Protein

ININ GenBank invertebrate subdivisionGenBank invertebrate subdivision PROTEINPROTEIN PIR fully annotated subdivisionPIR fully annotated subdivision

INVERTEBRATEINVERTEBRATE GenBank invertebrate subdivisionGenBank invertebrate subdivision PIR1PIR1 PIR fully annotated subdivisionPIR fully annotated subdivision

OMOM GenBank other mammalian subdivisionGenBank other mammalian subdivision PIR2PIR2 PIR preliminary subdivisionPIR preliminary subdivision

OTHERMAMMOTHERMAMM GenBank other mammalian subdivisionGenBank other mammalian subdivision PIR3PIR3 PIR unverified subdivisionPIR unverified subdivision

OVOV GenBank other vertebrate subdivision GenBank other vertebrate subdivision PIR4PIR4 PIR unencoded subdivisionPIR unencoded subdivision

OTHERVERTOTHERVERT GenBank other vertebrate subdivision GenBank other vertebrate subdivision NRL_3DNRL_3D PDB 3D protein sequencesPDB 3D protein sequences

PATPAT GenBank patent subdivision GenBank patent subdivision NRLNRL PDB 3D protein sequencesPDB 3D protein sequences

PATENTPATENT GenBank patent subdivision GenBank patent subdivision

PHPH GenBank phage subdivision GenBank phage subdivision

PHAGEPHAGE GenBank phage subdivisionGenBank phage subdivision General data files: General data files:

PLPL GenBank plant subdivision GenBank plant subdivision

PLANTPLANT GenBank plant subdivision GenBank plant subdivision GENMOREDATAGENMOREDATA path to GCG optional data filespath to GCG optional data files

PRPR GenBank primate subdivision GenBank primate subdivision GENRUNDATAGENRUNDATA path to GCG default data filespath to GCG default data files

PRIMATEPRIMATE GenBank primate subdivisionGenBank primate subdivision

RORO GenBank rodent subdivisionGenBank rodent subdivision

RODENTRODENT GenBank rodent subdivisionGenBank rodent subdivision

STSSTS GenBank (sequence tagged sites) subdivisionGenBank (sequence tagged sites) subdivision

SYSY GenBank synthetic subdivisionGenBank synthetic subdivision

SYNTHETICSYNTHETIC GenBank synthetic subdivisionGenBank synthetic subdivision

TAGSTAGS GenBank EST and GSS subdivisionsGenBank EST and GSS subdivisions

UNUN GenBank unannotated subdivisionGenBank unannotated subdivision

UNANNOTATEDUNANNOTATED GenBank unannotated subdivisionGenBank unannotated subdivision

VIVI GenBank viral subdivisionGenBank viral subdivision

VIRALVIRAL GenBank viral subdivisionGenBank viral subdivision

These are easy — These are easy — they make sense and they make sense and you’ll have a vested you’ll have a vested interest.interest.

Page 29: BioInformatics Databases

The List File Format —The List File Format —

An example GCG list file of many elongation An example GCG list file of many elongation

1a and Tu factors follows. As with all GCG 1a and Tu factors follows. As with all GCG

data files, two periods separate data files, two periods separate

documentation from data. ..documentation from data. ..

my-special.pepmy-special.pep begin:24begin:24 end:134end:134

SwissProt:EfTu_EcoliSwissProt:EfTu_Ecoli

Ef1a-Tu.msf{*}Ef1a-Tu.msf{*}

/usr/accounts/test/another.rsf{ef1a_*}/usr/accounts/test/another.rsf{ef1a_*}

@[email protected]

The ‘way’ SeqLab works!The ‘way’ SeqLab works!

Page 30: BioInformatics Databases

FOR EVEN MORE INFO...FOR EVEN MORE INFO...

Contact me (Contact me (stevetstevet@[email protected]) for specific ) for specific bioinformatics assistance and/or collaboration.bioinformatics assistance and/or collaboration.

There’s a bewildering assortment of different There’s a bewildering assortment of different

databases and ways to access and manipulate the databases and ways to access and manipulate the

information within them. The key is to learn how to information within them. The key is to learn how to

use that information in the most efficient manner. A use that information in the most efficient manner. A

comprehensive sequence analysis software suite, comprehensive sequence analysis software suite,

such as the Wisconsin Package, expedites the such as the Wisconsin Package, expedites the

chore, putting a large assortment of tools all under chore, putting a large assortment of tools all under

one organizational model with one user interface.one organizational model with one user interface.

Conclusions —Conclusions —