Sequence Databases – 21 June 2007 Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be.

Sequence Databases – 21 June 2007

Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be able to search GenBank for information. Be able to explain the content difference between a header, features and sequence. Be able to say what distinguishes between a primary database and a secondary database. Be able to access and navigate the ENTREZ platform for biological data analysis.

BIOSEQs – entry common to all sequence databases

BIOSEQ = Biological sequence Central element in the NCBI database model. Found in both the nucleotide and protein databases

Comprises the sequence of a single continuous molecule of nucleic acid or protein. Entry must have At least one sequence identifier (Seq-id) Information on the physical type of molecule (DNA, RNA, or

protein) Descriptors, which describe the entire Bioseq Annotations, which provide information regarding specific

locations within the Bioseq

What is GenBank?

The NIH genetic sequence database, an annotated collection of all publicly available DNA sequences

Each record represents a single contiguous stretch of DNA or RNA

DNA stretches may have more than one coding region (gene).

RNA sequences are presented with T, not U

Records are generated from direct submissions to the DNA sequence databases from the investigators (authors).

GenBank is part of the International Nucleotide Sequence Database Collaboration.

General Comments on GBFF

Three sections: 1) Header-information about the whole record 2) Features-description of annotations-each represented

by a key. 3) Nucleotide sequence-each ends with // on last line of

record.

Nucleic acid (DNA or RNA (cDNA)) sequence translated to amino acid sequence is a “feature”

Genbank Flat File (MyoD1 as an example)

Feature Keys

Purpose: 1) Indicates biological nature of sequence 2) Supplies information about changes to

sequencesFeature Key Description

conflict Separate determinations of the same seq. differ rep_origin Origin of replication protein_bind Protein binding site on DNA CDS Protein coding sequence

Feature Keys-Terminology

Feature Key Location/Qualifiers

CDS 23..400

/product=“alcohol dehydro.”

/gene=“adhI”

The feature CDS is a coding sequence beginning at base 23 and ending at base 400, has a product called “alcohol dehydrogenase” and corresponds to the gene called “adhI”.

Feature Keys-Terminology (Cont.)

Feat. Key Location/QualifiersCDS join (544..589,688..1032) /product=“T-cell recep. B-ch.”

/partial

The feature CDS is a partial coding sequence formed by joining the indicated elements to form one contiguous sequence encoding a product called T-cell receptor beta-chain.

(For MyoD1 – Accession number X61655)

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=53301

Record from GenBank

LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999

DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and

Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.

ACCESSION U49845

VERSION U49845.1 GI:1293613

KEYWORDS .

SOURCE baker's yeast.

ORGANISM Saccharomyces cerevisiae

Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales;

Saccharomycetaceae; Saccharomyces.

Modification dateGenBank division (plant, fungal and algal)

Coding regionUnique identifier (never changes)

Nucleotide sequence identifier (changes when there is a changein sequence (accession.version))

GeneInfo identifier (changes whenever there is a change)

Word or phrase describing the sequence (not based on controlled vocabulary).Not used in newer records.

Common name for organism

Formal scientific name for the source organism and its lineagebased on NCBI Taxonomy Database

Locus name

Record from GenBank (cont.1)

REFERENCE 1 (bases 1 to 5028)

AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.

TITLE Cloning and sequence of REV7, a gene whose function is required

for DNA damage-induced mutagenesis in Saccharomyces cerevisiae

JOURNAL Yeast 10 (11), 1503-1509 (1994)

MEDLINE 95176709


AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M.

TITLE Selection of axial growth sites in yeast requires Axl2p, a

novel plasma membrane glycoprotein

JOURNAL Genes Dev. 10 (7), 777-793 (1996)

MEDLINE 96194260

Medline UID


AUTHORS Roemer,T.

TITLE Direct Submission

JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University,

New Haven, CT, USA

Submitter of sequence (always the last reference)


FEATURES Location/Qualifiers

source 1..5028

/organism="Saccharomyces cerevisiae"

/db_xref="taxon:4932"

/chromosome="IX"

/map="9"

CDS <1..206

/codon_start=3

/product="TCP1-beta"

/protein_id="AAA98665.1"

/db_xref="GI:1293614"

/translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA

AEVLLRVDNIIRARPRTANRQHM"

The 5’ end of the coding sequence begins upstream of the first nucleotide of the sequence. The 3’ end is complete.

There are three parts to the feature key: a keyword (indicates functional group), a location (instruction for finding the feature), and a qualifier (auxiliary information about a feature)

Keys

Location

Qualifiers

Descriptive free text must be in quotations

Start of open reading frame

Database cross-refsProtein sequence ID #

Note: only a partial sequence

Values

Record from GenBank (cont.3) gene 687..3158 /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615"

/translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVN. . . “ gene complement(3300..4037) /gene="REV7" CDS complement(3300..4037) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA98667.1" /db_xref="GI:1293616"

/translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ . . . “

Cutoff

Cutoff

Another location

Another location


BASE COUNT 1510 a 1074 c 835 g 1609 t

ORIGIN

1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg

61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct . . .//

Primary databases vs. Secondary databases

Primary database comprises information submitted directly by the

experimenter. is called an archival database.

Secondary database comprises information derived from primary

database. is a curated database.

Types of primary databases carrying biological infomation

GenBank/EMBL/DDBJ

PDB-Three-dimensional structure coordinates of biological molecules

PROSITE-database of protein domain/function relationships. http://www.expasy.org/prosite/

http://www.expasy.org/prosite/

Types of secondary databases carrying biological infomation

dbSTS-Non-redundant db of sequence-tagged sites (useful for physical mapping)Genome databases-(there are over 20 genome databases that can be searchedEPD:eukaryotic promoter database http://www.epd.isb-sib.ch/

NR-non-redundant GenBank+EMBL+DDBJ+PDB. Entries with 100% sequence identity are merged as one.ProDom http://protein.toulouse.inra.fr/prodom/current/html/home.php

PRINTS http://bioinf.man.ac.uk/dbbrowser/PRINTS/

BLOCKS http://bioinformatics.weizmann.ac.il/blocks/

http://www.epd.isb-sib.ch/

http://www.epd.isb-sib.ch/

http://protein.toulouse.inra.fr/prodom/current/html/home.php

http://bioinformatics.weizmann.ac.il/blocks/



RNA

cDNA

DNA protein

DNA databases derived from GenBankcontaining data for a single gene

•Non-redundant (nr)•dbGSS (genome survey sequences)•dbHTGS (high throughput)•dbSTS (sequence tagged site)•LocusLink

RNA (cDNA) databases derivedfrom GenBankcontaining data for a single gene•dbEST (expressed sequence tag)•UniGene•LocusLink

Protein databases derivedfrom GenBank containingdata for a single gene•Non-redundant (nr)•Swissprot•PIR (Int’l. protein sequence)•LocusLink

Secondary Databases

References for understanding the NCBI sequence database model

Here is the website for NCBI developer tools. http://

www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/INDEX.HTML

Mature mRNA

RNA, but NOT mRNA

RNA, but NOT mRNA

DNA RNA PROTEIN RNA processing

Sequence Databases – 21 June 2007 Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be.

Documents

dna sequence databases

contiguous sequence

partial coding sequence

sequence databases bioseq

rna cdna sequence

amino acid sequence

feature cds

biological nature of