Special Topics BSC4933/5936: Special Topics BSC4933/5936: An Introduction to An Introduction to Bioinformatics Bioinformatics . . Florida State University Florida State University The Department of Biological Science The Department of Biological Science www.bio.fsu.edu www.bio.fsu.edu
Special Topics BSC4933/5936: An Introduction to Bioinformatics . Florida State University The Department of Biological Science www.bio.fsu.edu. BioInformatics Databases. Steven M. Thompson Florida State University School of Computational Science (SCS). So many Databases ????. NCBI ’s Entrez. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Special Topics BSC4933/5936:Special Topics BSC4933/5936:
An Introduction to BioinformaticsAn Introduction to Bioinformatics..
Florida State UniversityFlorida State University
The Department of Biological ScienceThe Department of Biological Science
www.bio.fsu.eduwww.bio.fsu.edu
BioInformatics DatabasesBioInformatics Databases
Steven M. ThompsonSteven M. Thompson
Florida State University School of Florida State University School of Computational Science (SCS)Computational Science (SCS)
NCBI’s NCBI’s
Entrez Entrez
But first some of my definitions, lots of overlap —But first some of my definitions, lots of overlap —
BiocomputingBiocomputing and and computational biologycomputational biology are synonyms and are synonyms and describe the use of computers and computational techniques describe the use of computers and computational techniques to analyze any type of a biological system, from individual to analyze any type of a biological system, from individual molecules to organisms to overall ecology.molecules to organisms to overall ecology.
BioinformaticsBioinformatics describes using computational techniques to describes using computational techniques to access, analyze, and interpret the biological information in access, analyze, and interpret the biological information in any type of biological database.any type of biological database.
Sequence analysisSequence analysis is the study of molecular sequence data for is the study of molecular sequence data for the purpose of inferring the function, interactions, evolution, the purpose of inferring the function, interactions, evolution, and perhaps structure of biological molecules.and perhaps structure of biological molecules.
GenomicsGenomics analyzes the context of genes or complete genomes analyzes the context of genes or complete genomes (the total DNA content of an organism) within the same and/or (the total DNA content of an organism) within the same and/or across different genomes.across different genomes.
ProteomicsProteomics is the subdivision of genomics concerned with is the subdivision of genomics concerned with analyzing the complete protein complement, i.e. the proteome, analyzing the complete protein complement, i.e. the proteome, of organisms, both within and between different organisms.of organisms, both within and between different organisms.
One way to think about the field —One way to think about the field —The Reverse Biochemistry Analogy.The Reverse Biochemistry Analogy.
Biochemists no longer have to begin a research project by Biochemists no longer have to begin a research project by
isolating and purifying massive amounts of a protein from isolating and purifying massive amounts of a protein from
its native organism in order to characterize a particular its native organism in order to characterize a particular
gene product. Rather, now scientists can amplify a gene product. Rather, now scientists can amplify a
section of some genome based on its similarity to other section of some genome based on its similarity to other
genomes, sequence that piece of DNA and, genomes, sequence that piece of DNA and, using using
sequence analysis tools, infer all sorts of functional, sequence analysis tools, infer all sorts of functional,
evolutionary, and, perhaps, structural insight into that evolutionary, and, perhaps, structural insight into that
stretch of DNA!stretch of DNA!
The computer and molecular databases are a The computer and molecular databases are a
necessary, integral part of this entire process.necessary, integral part of this entire process.
The exponential growth of molecular sequence The exponential growth of molecular sequence databases databases & cpu power —& cpu power —YearYear BasePairsBasePairs SequencesSequences
The Human Genome Project and numerous smaller The Human Genome Project and numerous smaller
genome projects have kept the data coming at alarming genome projects have kept the data coming at alarming
rates. As of December 2004, almost 240 complete rates. As of December 2004, almost 240 complete
genomes are publicly available for analysis, not genomes are publicly available for analysis, not
counting all the virus and viroid genomes available.counting all the virus and viroid genomes available.
The International Human Genome Sequencing The International Human Genome Sequencing
Consortium announced the completion of the "Working Consortium announced the completion of the "Working
Draft" of the human genome in June 2000;Draft" of the human genome in June 2000;
Independently that same month, the private company Independently that same month, the private company
Celera GenomicsCelera Genomics announced that it had completed the announced that it had completed the
first “Assembly” of the human genome. Both articles first “Assembly” of the human genome. Both articles
were published mid-February 2001 in the journals were published mid-February 2001 in the journals
ScienceScience and and NatureNature..
Some neat stuff from the papers —Some neat stuff from the papers —We, We, Homo sapiensHomo sapiens, aren’t nearly as special as , aren’t nearly as special as
we had hoped we were. Of the 3.2 billion we had hoped we were. Of the 3.2 billion base pairs in our DNA:base pairs in our DNA:
Traditional, text-book estimates of the number of genes Traditional, text-book estimates of the number of genes were often in the 100,000 range; turns out we’ve only were often in the 100,000 range; turns out we’ve only got about twice as many as a fruit fly, between 25’ and got about twice as many as a fruit fly, between 25’ and 35,000!35,000!
The protein coding region of the genome is only about The protein coding region of the genome is only about 1% or so, a bunch of the remainder is ‘jumping’ 1% or so, a bunch of the remainder is ‘jumping’ ‘selfish DNA’ of which much may be involved in ‘selfish DNA’ of which much may be involved in regulation and control.regulation and control.
Over 100-200 genes were transferred from an ancestral Over 100-200 genes were transferred from an ancestral bacterial genome to an ancestral vertebrate genome! bacterial genome to an ancestral vertebrate genome! ((Later shown to be not true by more extensive analyses, and to Later shown to be not true by more extensive analyses, and to
be due to gene loss rather than transfer.be due to gene loss rather than transfer.))
These databases are an organized way to store the tremendous These databases are an organized way to store the tremendous amount of sequence information accumulating worldwide. Most have amount of sequence information accumulating worldwide. Most have their own specific format. An their own specific format. An ‘alphabet soup’ of t‘alphabet soup’ of three major database hree major database organizations around the world are responsible for maintaining most organizations around the world are responsible for maintaining most of this data. They largely ‘mirror’ one another and share accession of this data. They largely ‘mirror’ one another and share accession codes, but codes, but NOTNOT proper identifier names: proper identifier names:
North America: the National Center for Biotechnology Information (North America: the National Center for Biotechnology Information (NCBI), ), a division of the National Library of Medicine (NLM), at the National a division of the National Library of Medicine (NLM), at the National Institute of Health (NIH), has Institute of Health (NIH), has GenBank & GenPept. Also Georgetown & GenPept. Also Georgetown University’s National Biomedical Research Foundation (NBRF) Protein University’s National Biomedical Research Foundation (NBRF) Protein Identification Resource (Identification Resource (PIR) & ) & NRL_3D (Naval Research Lab (Naval Research Lab sequences of known three-dimensional structure).sequences of known three-dimensional structure).
Europe: the European Molecular Biology Laboratory (Europe: the European Molecular Biology Laboratory (EMBL), the European ), the European Bioinformatics Institute (Bioinformatics Institute (EBI), and the ), and the Swiss Institute of Bioinformatics’ Swiss Institute of Bioinformatics’ (SIB) Expert Protein Analysis System ((SIB) Expert Protein Analysis System (ExPasy), all help maintain the), all help maintain the EMBL Nucleotide Sequence Database, and Nucleotide Sequence Database, and the the SWISS-PROT & & TrEMBL amino acid sequence databases. amino acid sequence databases.
Asia: TAsia: The National Institute of Genetics (NIG) supports the National Institute of Genetics (NIG) supports the he Center for Center for Information Biology’s (CIG) Information Biology’s (CIG) DNA Data Bank of Japan (DNA Data Bank of Japan (DDBJ). ).
What are sequence databases?What are sequence databases?
A little history —A little history —Developments that affect software and the end user —Developments that affect software and the end user —
The first well recognized sequence database was Dr. Margaret Dayhoff’s The first well recognized sequence database was Dr. Margaret Dayhoff’s hardbound hardbound Atlas of Protein Sequence and StructureAtlas of Protein Sequence and Structure begun in the mid- begun in the mid-sixties. sixties. DDBJDDBJ began in 1984, began in 1984, GenBankGenBank in 1982, and in 1982, and EMBLEMBL in 1980. in 1980. They are all attempts at establishing an organized, reliable, They are all attempts at establishing an organized, reliable, comprehensive and openly available library of genetic sequences. comprehensive and openly available library of genetic sequences. Databases have long-since outgrown a hardbound atlas. They have Databases have long-since outgrown a hardbound atlas. They have become huge and have evolved through many changes with many more become huge and have evolved through many changes with many more yet to come.yet to come.
Changes in format over the years are a major source of grief for software Changes in format over the years are a major source of grief for software designers and program users. Each program needs to be able to designers and program users. Each program needs to be able to recognize particular aspects of the sequence files; whenever they recognize particular aspects of the sequence files; whenever they change it throws a wrench in the works. NCBI’s change it throws a wrench in the works. NCBI’s ASN.1ASN.1 format and its format and its EntrezEntrez interface attempt to circumvent some of these frustrations. interface attempt to circumvent some of these frustrations. However, database format is much debated as many bioinformaticians However, database format is much debated as many bioinformaticians argue for relational or object-oriented standards. Unfortunately, until all argue for relational or object-oriented standards. Unfortunately, until all biologists and computer scientists worldwide agree on one standard and biologists and computer scientists worldwide agree on one standard and all software is (re)written to that standard, neither of which is likely to all software is (re)written to that standard, neither of which is likely to happen very quickly, format issues will remain probably the most happen very quickly, format issues will remain probably the most confusing and troubling aspect of working with primary sequence data.confusing and troubling aspect of working with primary sequence data.
So what are these databases like?So what are these databases like?Just what are primary sequences?Just what are primary sequences?
(Central Dogma: DNA —> RNA —> protein)(Central Dogma: DNA —> RNA —> protein)
Primary refers to one dimension — all of the ‘symbol’ information Primary refers to one dimension — all of the ‘symbol’ information
written in sequential order necessary to specify a particular written in sequential order necessary to specify a particular
biological molecular entity, be it polypeptide or nucleotide.biological molecular entity, be it polypeptide or nucleotide.
The symbols are the one letter codes for all of the biological The symbols are the one letter codes for all of the biological
nitrogenous bases and amino acid residues and their ambiguity nitrogenous bases and amino acid residues and their ambiguity
codes. Biological carbohydrates, lipids, and structural and codes. Biological carbohydrates, lipids, and structural and
functional information are not sequence data. Not even DNA functional information are not sequence data. Not even DNA
translations in a DNA database!translations in a DNA database!
However, much of this feature and bibliographic type information However, much of this feature and bibliographic type information
is available in the reference documentation sections associated is available in the reference documentation sections associated
with primary sequences in the databases.with primary sequences in the databases.
Sequence database installations are commonly a complex Sequence database installations are commonly a complex
ASCII/Binary mix, usually not relational or Object Oriented (but ASCII/Binary mix, usually not relational or Object Oriented (but
proprietary ones often are). They’ll contain several very long proprietary ones often are). They’ll contain several very long
text files each containing different types of information all text files each containing different types of information all
related to particular sequences, such as all of the sequences related to particular sequences, such as all of the sequences
themselves, versus all of the title lines, or all of the reference themselves, versus all of the title lines, or all of the reference
sections. Binary files often help ‘glue together’ all of these sections. Binary files often help ‘glue together’ all of these
other files by providing indexing functions. other files by providing indexing functions.
Software is usually required to successfully interact with these Software is usually required to successfully interact with these
databases and access is most easily handled through various databases and access is most easily handled through various
software packages and interfaces, either on the World Wide software packages and interfaces, either on the World Wide
Web or otherwise. Web or otherwise.
Content & Organization —Content & Organization —
More organization stuff —More organization stuff —
Nucleic acid sequence databases (and TrEMBL) are split into Nucleic acid sequence databases (and TrEMBL) are split into subdivisions based on taxonomy (historical rankings — the Fungi subdivisions based on taxonomy (historical rankings — the Fungi warning!). PIR is split into subdivisions based on level of warning!). PIR is split into subdivisions based on level of annotation. TrEMBL sequences are merged into SWISS-PROT annotation. TrEMBL sequences are merged into SWISS-PROT as they receive increased levels of annotation.as they receive increased levels of annotation.
All sequence databases contain these elements:All sequence databases contain these elements:
NameName: LOCUS, ENTRY, ID all are unique identifiers: LOCUS, ENTRY, ID all are unique identifiers
DefinitionDefinition: A brief, one-line, textual sequence description.: A brief, one-line, textual sequence description.
Accession NumberAccession Number: A constant data identifier.: A constant data identifier.
Source and taxonomy information.Source and taxonomy information.
Complete literature references.Complete literature references.
Comments and keywords.Comments and keywords.
The all important The all important FEATUREFEATURE table! table!
A summary or checksum line.A summary or checksum line.
The The sequencesequence itself. itself.
But:But:
Each major database as well as each major suite of software tools Each major database as well as each major suite of software tools
that you are likely to use has its own distinct format requirements. that you are likely to use has its own distinct format requirements.
This can be a huge problem and an enormous time sink, even with This can be a huge problem and an enormous time sink, even with
helpful tools such as Don Gilbert’s helpful tools such as Don Gilbert’s ReadSeqReadSeq. Therefore, becoming . Therefore, becoming
familiar with some of the common formats is a big help. Look for key familiar with some of the common formats is a big help. Look for key
features of each type of entry:features of each type of entry:
Parts and problems —Parts and problems —
Gen
Ban
k and GenP
ept format —
LOCUSLOCUS HSEF1AR 1506 bp mRNA linear PRI 12-SEP-1993 HSEF1AR 1506 bp mRNA linear PRI 12-SEP-1993
DEFINITION Human mRNA for elongation factor 1 alpha subunit (EF-1 alpha).DEFINITION Human mRNA for elongation factor 1 alpha subunit (EF-1 alpha).
ACCESSIONACCESSION X03558 X03558
VERSION X03558.1 GI:31097VERSION X03558.1 GI:31097
SEQUENCESEQUENCE 5 10 15 20 25 305 10 15 20 25 30 1 M G K E K T H I N I V V I G H V D S G K S T T T G H L I Y K1 M G K E K T H I N I V V I G H V D S G K S T T T G H L I Y K 31 C G G I D K R T I E K F E K E A A E M G K G S F K Y A W V L31 C G G I D K R T I E K F E K E A A E M G K G S F K Y A W V L 61 D K L K A E R E R …... Q K A Q K A K61 D K L K A E R E R …... Q K A Q K A K
Databases that contain special types of sequence Databases that contain special types of sequence information, such as patterns, motifs, and profiles. information, such as patterns, motifs, and profiles. These include: REBASE, These include: REBASE, EPDEPD, , PROSITEPROSITE, , BLOCKSBLOCKS, , ProDomProDom, , PfamPfam . . . . . . . .
Databases that contain multiple sequence entries Databases that contain multiple sequence entries aligned, e.g. aligned, e.g. RDPRDP and and ALNALN..
Databases that contain families of sequences ordered Databases that contain families of sequences ordered functionally, structurally, or phylogenetically, e.g. functionally, structurally, or phylogenetically, e.g. iProClassiProClass and and HOVERGENHOVERGEN..
Databases of species specific sequences, e.g. the Databases of species specific sequences, e.g. the HIV DatabaseHIV Database and the and the Giardia lambliaGiardia lamblia Genome ProjectGenome Project..
And on and on . . . . See Amos Bairoch’s excellent links And on and on . . . . See Amos Bairoch’s excellent links page: page: http://us.http://us.expasyexpasy.org/.org/alinksalinks.html.html and the and the wonderful Human Genome Ensemble Project at wonderful Human Genome Ensemble Project at http://www.ensembl.org/http://www.ensembl.org/ that tries to tie it all together. that tries to tie it all together.
What about other types of biological databases?What about other types of biological databases?
Three dimensional structure databases:Three dimensional structure databases:
the the Protein Data BankProtein Data Bank and and Rutgers Nucleic Acid DatabaseRutgers Nucleic Acid Database..
These databases contain all of the 3D atomic coordinate data These databases contain all of the 3D atomic coordinate data
necessary to define the tertiary shape of a particular biological necessary to define the tertiary shape of a particular biological
molecule. The data is usually experimentally derived, either by molecule. The data is usually experimentally derived, either by
X-ray crystallography or with NMR, but sometimes it is a X-ray crystallography or with NMR, but sometimes it is a
hypothetical model. In all cases the source of the structure and hypothetical model. In all cases the source of the structure and
its resolution is clearly indicated.its resolution is clearly indicated.
Secondary structure boundaries, sequence data, and reference Secondary structure boundaries, sequence data, and reference
information are often associated with the coordinate data, but it information are often associated with the coordinate data, but it
is the 3D data that really matters, not the annotation.is the 3D data that really matters, not the annotation.
Molecular visualization or modeling software is required to interact Molecular visualization or modeling software is required to interact
with the data. It has little meaning on its own. See Molecules with the data. It has little meaning on its own. See Molecules
to Go at to Go at http://http://molbiomolbio.info..info.nihnih..govgov//cgicgi-bin/-bin/pdbpdb// . .
Other types of Biological DB’s —Other types of Biological DB’s —Still more; these can be considered ‘non-molecular’:Still more; these can be considered ‘non-molecular’:
Genomic linkage mapping databases for most large genome projects (w/ pointers to sequences)(w/ pointers to sequences) — H. sapiens, Mus, Drosophila, C. elegans, Saccharomyces, Arabidopsis, E. coli, . . . .
Reference Databases (also w/ pointers to sequences): e.g. Reference Databases (also w/ pointers to sequences): e.g.
OMIMOMIM — Online Mendelian Inheritance in Man — Online Mendelian Inheritance in Man
PubMedPubMed//MedLineMedLine — over 11 million citations from — over 11 million citations from more than 4 thousand bio/medical scientific journals. more than 4 thousand bio/medical scientific journals.
Phylogenetic Tree Databases: e.g. the Tree of Life.Phylogenetic Tree Databases: e.g. the Tree of Life.
Metabolic Pathway Databases: e.g. Metabolic Pathway Databases: e.g. WITWIT (What Is There) and (What Is There) and Japan’s GenomeNet Japan’s GenomeNet KEGGKEGG (the Kyoto Encyclopedia of Genes and (the Kyoto Encyclopedia of Genes and Genomes).Genomes).
Population studies data — which strains, where, etc.Population studies data — which strains, where, etc.
And then databases that many biocomputing people don’t even usually And then databases that many biocomputing people don’t even usually consider:consider:
e.g. GIS/GPS/remote sensing data, medical records, census counts, e.g. GIS/GPS/remote sensing data, medical records, census counts, mortality and birth rates . . . .mortality and birth rates . . . .
So how do you access and manipulate all this data?So how do you access and manipulate all this data?Often on the InterNet over the World Wide Web:Often on the InterNet over the World Wide Web:
Nat’l Center Biotech' Info'Nat’l Center Biotech' Info' http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/ databases/analysis/softwaredatabases/analysis/software
PIR/NBRFPIR/NBRF http://www-nbrf.georgetown.edu/http://www-nbrf.georgetown.edu/ protein sequence databaseprotein sequence database
European Mol' Bio' Lab'European Mol' Bio' Lab' http://www.embl-heidelberg.de/http://www.embl-heidelberg.de/ databases/analysis/softwaredatabases/analysis/software
European BioinformaticsEuropean Bioinformatics http://www.ebi.ac.uk/http://www.ebi.ac.uk/ databases/analysis/softwaredatabases/analysis/software
The Sanger InstituteThe Sanger Institute http://www.sanger.ac.uk/http://www.sanger.ac.uk/ databases/analysis/softwaredatabases/analysis/software
Univ. of Geneva BioWebUniv. of Geneva BioWeb http://www.expasy.ch/http://www.expasy.ch/ databases/analysis/softwaredatabases/analysis/software
ProteinDataBankProteinDataBank http://www.rcsb.org/pdb/http://www.rcsb.org/pdb/ 3D mol' structure database3D mol' structure database
Molecules to GoMolecules to Go http://molbio.info.nih.gov/cgi-bin/pdb/http://molbio.info.nih.gov/cgi-bin/pdb/ 3D protein/nuc' visualization3D protein/nuc' visualization
The Genome DataBaseThe Genome DataBase http://www.gdb.org/http://www.gdb.org/ The Human Genome ProjectThe Human Genome Project
Stanford GenomicsStanford Genomics http://genome-www.stanford.edu/http://genome-www.stanford.edu/ various genome projectsvarious genome projects
Inst. for Genomic Res’rchInst. for Genomic Res’rch http://www.tigr.org/http://www.tigr.org/ esp. microbial genome projectsesp. microbial genome projects
HIV Sequence DatabaseHIV Sequence Database http://hiv-web.lanl.gov/http://hiv-web.lanl.gov/ HIV epidemeology seq' DBHIV epidemeology seq' DB
The Tree of LifeThe Tree of Life http://tolweb.org/tree/phylogeny.htmlhttp://tolweb.org/tree/phylogeny.html overview of all phylogenyoverview of all phylogeny
PUMA2 at ArgonnePUMA2 at Argonne http://compbio.mcs.anl.gov/puma2/cgi-bin/http://compbio.mcs.anl.gov/puma2/cgi-bin/ metabolic reconstructionmetabolic reconstruction
UNIX server computers here.UNIX server computers here.Again public domain programs exist. But now a VERY Again public domain programs exist. But now a VERY
cooperative systems manager needs to install, configure, and cooperative systems manager needs to install, configure, and
maintain the system. Therefore a commercial package, e.g. maintain the system. Therefore a commercial package, e.g.
the Wisconsin Package, is often used to simplify matters.the Wisconsin Package, is often used to simplify matters.
One commercial license fee for an entire institution and very fast, One commercial license fee for an entire institution and very fast,
convenient database access on local server disks. convenient database access on local server disks.
Connections from any networked terminal or workstation Connections from any networked terminal or workstation
anywhere!anywhere!
Within the GCG suite, Within the GCG suite, LookUpLookUp is an SRS derivative used to find a is an SRS derivative used to find a
sequence of interest from local GCG server databases.sequence of interest from local GCG server databases.
Advantage: Search output is a legitimate GCG list file, appropriate Advantage: Search output is a legitimate GCG list file, appropriate
input to other GCG programs; no need to reformat — all GCG.input to other GCG programs; no need to reformat — all GCG.
Disadvantage: DB’s only as new as administrator maintains them.Disadvantage: DB’s only as new as administrator maintains them.
The Genetics Computer Group — The Genetics Computer Group — the Wisconsin Package for Sequence Analysis.the Wisconsin Package for Sequence Analysis.
Begun in 1982 in Oliver Smithies’ lab at the Genetics Dept. Begun in 1982 in Oliver Smithies’ lab at the Genetics Dept. at the University of Wisconsin, Madison, then a private at the University of Wisconsin, Madison, then a private company for over 10 years, then acquired by the Oxford company for over 10 years, then acquired by the Oxford Molecular Group U.K., and now owned by Pharmacopeia Molecular Group U.K., and now owned by Pharmacopeia U.S.A. under the new name Accelrys, Inc.U.S.A. under the new name Accelrys, Inc.
The suite contains almost 150 programs designed to work in The suite contains almost 150 programs designed to work in a "toolbox" fashion. Several simple programs used in a "toolbox" fashion. Several simple programs used in succession can lead to sophisticated results.succession can lead to sophisticated results.
Also 'internal compatibility,' i.e. once you learn to use one Also 'internal compatibility,' i.e. once you learn to use one program, all programs can be run similarly, and, the program, all programs can be run similarly, and, the output from many programs can be used as input for output from many programs can be used as input for other programs.other programs.
Used all over the world by more than 30,000 scientists at Used all over the world by more than 30,000 scientists at over 530 institutions in 35 countries, so learning it here over 530 institutions in 35 countries, so learning it here will most likely be useful anywhere else you may end up.will most likely be useful anywhere else you may end up.
To answer the always perplexing GCG question — “What To answer the always perplexing GCG question — “What sequence(s)? . . . .”sequence(s)? . . . .”
The sequence is in a local GCG format single sequence file in your UNIX The sequence is in a local GCG format single sequence file in your UNIX account. (GCG Reformat and all From & To programs)account. (GCG Reformat and all From & To programs)
The sequence is in a local GCG database in which case you ‘point’ to it by The sequence is in a local GCG database in which case you ‘point’ to it by using any of the GCG database logical names. A colon, “:,” always sets using any of the GCG database logical names. A colon, “:,” always sets the logical name apart from either an accession number or a proper the logical name apart from either an accession number or a proper identifier name or a wildcard expression and they are case insensitive.identifier name or a wildcard expression and they are case insensitive.
The sequence is in a GCG format multiple sequence file, either an MSF The sequence is in a GCG format multiple sequence file, either an MSF (multiple sequence format) file or an RSF (rich sequence format) file. To (multiple sequence format) file or an RSF (rich sequence format) file. To specify sequences contained in a GCG multiple sequence file, supply the specify sequences contained in a GCG multiple sequence file, supply the file name followed by a pair of braces, “{},” containing the sequence file name followed by a pair of braces, “{},” containing the sequence specification, e.g. a wildcard — {specification, e.g. a wildcard — {**}.}.
Finally, the most powerful method of specifying sequences is in a GCG “list” Finally, the most powerful method of specifying sequences is in a GCG “list” file. It is merely a list of other sequence specifications and can even file. It is merely a list of other sequence specifications and can even contain other list files within it. The convention to use a GCG list file in a contain other list files within it. The convention to use a GCG list file in a program is to precede it with an at sign, “@.” Furthermore, one can program is to precede it with an at sign, “@.” Furthermore, one can supply attribute information within list files to specify something special supply attribute information within list files to specify something special about the sequence.about the sequence.
Specifying sequences, GCG style;Specifying sequences, GCG style;in order of increasing power and complexity:in order of increasing power and complexity:
Logical terms for the Wisconsin Package —Logical terms for the Wisconsin Package —Sequence databases, nucleic acids:Sequence databases, nucleic acids: Sequence databases, amino acids:Sequence databases, amino acids:
GENBANKPLUSGENBANKPLUS all of GenBank plus EST and GSS subdivisionsall of GenBank plus EST and GSS subdivisions GENPEPTGENPEPT GenBank CDS translationsGenBank CDS translations
GBPGBP all of GenBank plus EST and GSS subdivisionsall of GenBank plus EST and GSS subdivisions GPGP GenBank CDS translationsGenBank CDS translations
GENBANKGENBANK all of GenBank except EST and GSS subdivisionsall of GenBank except EST and GSS subdivisions SWISSPROTPLUSSWISSPROTPLUS all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL
GBGB all of GenBank except EST and GSS subdivisionsall of GenBank except EST and GSS subdivisions SWPSWP all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL
BABA GenBank bacterial subdivisionGenBank bacterial subdivision SWISSPROTSWISSPROT all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)
BACTERIALBACTERIAL GenBank bacterial subdivisionGenBank bacterial subdivision SWSW all of Swiss-Prot (fully annotated) all of Swiss-Prot (fully annotated)
ESTEST GenBank EST (Expressed Sequence Tags) subdivisionGenBank EST (Expressed Sequence Tags) subdivision SPTREMBLSPTREMBL Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations
OMOM GenBank other mammalian subdivisionGenBank other mammalian subdivision PIR2PIR2 PIR preliminary subdivisionPIR preliminary subdivision
OTHERMAMMOTHERMAMM GenBank other mammalian subdivisionGenBank other mammalian subdivision PIR3PIR3 PIR unverified subdivisionPIR unverified subdivision
OVOV GenBank other vertebrate subdivision GenBank other vertebrate subdivision PIR4PIR4 PIR unencoded subdivisionPIR unencoded subdivision
OTHERVERTOTHERVERT GenBank other vertebrate subdivision GenBank other vertebrate subdivision NRL_3DNRL_3D PDB 3D protein sequencesPDB 3D protein sequences
PATPAT GenBank patent subdivision GenBank patent subdivision NRLNRL PDB 3D protein sequencesPDB 3D protein sequences
Contact me (Contact me (stevetstevet@[email protected]) for specific ) for specific bioinformatics assistance and/or collaboration.bioinformatics assistance and/or collaboration.
There’s a bewildering assortment of different There’s a bewildering assortment of different
databases and ways to access and manipulate the databases and ways to access and manipulate the
information within them. The key is to learn how to information within them. The key is to learn how to
use that information in the most efficient manner. A use that information in the most efficient manner. A