Sequence Databases – 21 June 2007 Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be able to search GenBank for information. Be able to explain the content difference between a header, features and sequence. Be able to say what distinguishes between a primary database and a secondary database. Be able to access and navigate the ENTREZ platform for biological data analysis.
20
Embed
Sequence Databases – 21 June 2007 Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Sequence Databases – 21 June 2007
Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be able to search GenBank for information. Be able to explain the content difference between a header, features and sequence. Be able to say what distinguishes between a primary database and a secondary database. Be able to access and navigate the ENTREZ platform for biological data analysis.
BIOSEQs – entry common to all sequence databases
BIOSEQ = Biological sequence Central element in the NCBI database model. Found in both the nucleotide and protein databases
Comprises the sequence of a single continuous molecule of nucleic acid or protein. Entry must have At least one sequence identifier (Seq-id) Information on the physical type of molecule (DNA, RNA, or
protein) Descriptors, which describe the entire Bioseq Annotations, which provide information regarding specific
locations within the Bioseq
What is GenBank?
The NIH genetic sequence database, an annotated collection of all publicly available DNA sequences
Each record represents a single contiguous stretch of DNA or RNA
DNA stretches may have more than one coding region (gene).
RNA sequences are presented with T, not U
Records are generated from direct submissions to the DNA sequence databases from the investigators (authors).
GenBank is part of the International Nucleotide Sequence Database Collaboration.
General Comments on GBFF
Three sections: 1) Header-information about the whole record 2) Features-description of annotations-each represented
by a key. 3) Nucleotide sequence-each ends with // on last line of
record.
Nucleic acid (DNA or RNA (cDNA)) sequence translated to amino acid sequence is a “feature”
Genbank Flat File (MyoD1 as an example)
Feature Keys
Purpose: 1) Indicates biological nature of sequence 2) Supplies information about changes to
sequencesFeature Key Description
conflict Separate determinations of the same seq. differ rep_origin Origin of replication protein_bind Protein binding site on DNA CDS Protein coding sequence
Feature Keys-Terminology
Feature Key Location/Qualifiers
CDS 23..400
/product=“alcohol dehydro.”
/gene=“adhI”
The feature CDS is a coding sequence beginning at base 23 and ending at base 400, has a product called “alcohol dehydrogenase” and corresponds to the gene called “adhI”.
The feature CDS is a partial coding sequence formed by joining the indicated elements to form one contiguous sequence encoding a product called T-cell receptor beta-chain.
The 5’ end of the coding sequence begins upstream of the first nucleotide of the sequence. The 3’ end is complete.
There are three parts to the feature key: a keyword (indicates functional group), a location (instruction for finding the feature), and a qualifier (auxiliary information about a feature)
Keys
Location
Qualifiers
Descriptive free text must be in quotations
Start of open reading frame
Database cross-refsProtein sequence ID #
Note: only a partial sequence
Values
Record from GenBank (cont.3) gene 687..3158 /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615"
Types of secondary databases carrying biological infomation
dbSTS-Non-redundant db of sequence-tagged sites (useful for physical mapping)Genome databases-(there are over 20 genome databases that can be searchedEPD:eukaryotic promoter database http://www.epd.isb-sib.ch/
NR-non-redundant GenBank+EMBL+DDBJ+PDB. Entries with 100% sequence identity are merged as one.ProDom http://protein.toulouse.inra.fr/prodom/current/html/home.php