Top Banner
Biological databases Nicky Mulder: [email protected]
27
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Biological databases

Nicky Mulder: [email protected]

Page 2: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

What is a database

• an organized body of related infomation www.cogsci.princeton.edu/cgi-bin/webwn

• Data collection that is:– Structured (computer readable)– Searchable– Updatable– Cross-linked– Publicly available

Page 3: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Biological Databases

• Make data available to public• So much data available, needs ordering• Turn data into computer-readable form• Ability to retrieve data from various sources• Can have primary (archival) or secondary databases

(curated)

Most commonly used are sequence databases

Page 4: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Biological systems

Taxonomic data

Literature

Protein folding and 3D structure

Small molecules

Pathways and networks

Biological systems

Protein families and domains

Whole genome data

Sequence data

Page 5: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Biological systems

Taxonomic data

Literature

Protein folding and 3D structure

Small molecules

Pathways and networks

Biological systems

Protein families and domains

Whole genome data

Sequence data

Page 6: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Biological systems

Taxonomic data

Literature

Protein folding and 3D structure

Small molecules

Pathways and networks

Biological systems

Protein families and domains

Whole genome data

Sequence data

Ontologies -GO

Page 7: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Sequence databases

• Used for retrieving a known gene/protein sequence• Useful for finding information on a gene/protein• Can find out how many genes are available for a given

organism• Can comparing your sequence to the others in the

database• Can submit your sequence to store with the rest• Main databases: nucleotide and protein sequence DBs

Page 8: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Requirements for good sequence database

• It must be complete with minimal redundancy

• It must contain as much up-to-date information (annotation) as possible on each sequence

• All the information items must be retrievable by computer programs in a consistent manner

• It must be highly interoperable with other databases

Page 9: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Nucleotide sequence databases

• EMBL, DDBJ, GenBank

• Data submitted by sequence owner

• Must provide certain information and CDS if applicable

• No additional annotation added

• Entries never merged –some redundancy

PromoterExons

CDS (coding sequence)

Page 10: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Example EMBL entry 1: general info

ID AB083336 standard; genomic DNA; MAM; 6116 BP.

AC AB083336; XX SV AB083336.1

DT 06-JAN-2005 (Rel. 82, Created) DT 06-JAN-2005 (Rel. 82, Last updated, Version 1)

DE Sus scrofa p27Kip1 gene for p27Kip1, p27Kip1R, complete cds, alternative DE splicing.

OS Sus scrofa (pig) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Cetartiodactyla; Suina; Suidae; Sus.

RN [1] RP 1-6116 RA Hirano K., Shintani Y., Hirano M., Kanaide H.;

RT ;

RL Submitted (08-APR-2002) to the EMBL/GenBank/DDBJ databases. RL Katsuya Hirano, Graduate School of Medical Sciences, Kyushu University, RL Division of Molecular Cardiology, Research Institute of Angiocardiology;

RL 3-1-1 Maidashi, Higashi-ku, Fukuoka, Fukuoka, 812-8582, Japan

RL (E-mail:[email protected], Tel:81-92-642-5550, RL Fax:81-92-642-5552)

RN [2] RA Shintani Y., Hirano K., Hirano M., Nishimura J., Nakano H., Kanaide H.;

RT "Cloning and Charaterization of full sequence of porcine p27Kip1 gene and RT expression of splice isoform p27Kip1R";

RL Unpublished.

References

Description of gene

Accession number

Page 11: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Example EMBL entry 2:

features on the sequence -CDS

FH Key Location/Qualifiers

FT source 1..6116

FT /db_xref="taxon:9823"

FT /mol_type="genomic DNA"

FT /organism="Sus scrofa"

FT /cell_type="liver"

FT /clone_lib="lambda Fix II porcine genomic DNA"

FT exon 784..1714

FT /evidence=NOT_EXPERIMENTAL

FT /note="The residue 2591 corresponds to the transcription

FT initiation site determined in human gene"

FT CDS join(1240..1714,2261..2271,5104..5160)

FT /codon_start=1

FT /gene="p27Kip1"

FT /product="p27Kip1R"

FT /protein_id="BAD83612.1"

FT /translation="MSNVRVSNGSPSLERMDARQAEYPKPSACRNLFGPVNHEELTRDL

FT EKHCRDMEEASQRKWNFDFQNHKPLEGKYEWQEVEKGSLPEFYYRPPRPPKGACKVPAQ

FT EGQGVSGTRQAVPLIGSQANSEDTHLVDQKTDAPDSQTGLAEQCTGIRKRPATDDSSPP

FT SVSLKIGMYQLNYSSVW"

Corresponding protein sequence

Feature type and location

Feature name and information

Page 12: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

FT intron 1715..2260

FT /cons_splice=(5'site:NO,3'site:NO)

FT exon 2261..2390

FT /number=2

FT intron 2391..4494

FT /cons_splice=(5'site:NO,3'site:NO)

FT exon 4495..5824 FT /note="ending at a putative poly A site following a polyA

FT signal"

FT /number=3

FT polyA_signal 5802..5807 XX SQ Sequence 6116 BP; 1583 A; 1392 C; 1438 G; 1703 T; 0 other;

gcggccgcga gctcaattaa ccctcactaa agggagtcga ctcgatctcg aagccctttt 60

cttgttttta ttgagggaga gcttgggttc agaatacatt acaaatgcag catctattcc 120

agtctactta tagaaagacg tcctcctggg cttcccccct aagccccctg cctcccctag 180

aacagcacag acttctaggt taagggtgag ctaaccactg ctcaccccca gctaaggcac 240

ccaggctcag gggctccccg cctcccccgc tgagcgagcg gtgggggccc ccccgggaga 300

gagcccagct gggggccgag cgcccagcgg cgagcccagc tgcccgcccc tacccgctcg 360

gcgagcgagg ggaaaataag atcgccctcg gcgaggagag ggaggtcggg gctccggagc 420

Example EMBL entry 3: features on the sequence –

introns and exons

DNA sequence

Page 13: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Summary of information in EMBL entries

• Describes sequence type, e.g. genomic DNA, RNA, EST

• Provides taxonomy from which sequence came• Provides information on submitters and references• Describes features on a sequence NB for function,

replication, recombination, structure etc.• Shows if the DNA encodes a protein (CDS) and

provides protein sequence • Provides actual nucleotide sequence

Page 14: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Protein sequences

DNA

RNA

Protein

SS

Ac

Protein cleavage Protein modification

Transported to organelle or membrane

Folded into secondary or

tertiary structure

Performs a specific function

All this info needs to be captured in a database

Page 15: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Protein Sequence Databases• UniProt:

– Swiss-Prot –manually curated, distinguishes between experimental and computationally derived annotation

– TrEMBL - Automatic translation of EMBL, no manual curation, some automatic annotation

• GenPept -GenBank translations• RefSeq - Non-redundant sequences for certain

organisms• IPI –International protein Index –combination of

many protein sequence databases

Page 16: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Example of a Swiss-Prot entry 1

References

General information

Page 17: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Example of a Swiss-Prot entry 2

Cross-references

Functional information

Page 18: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Example of a Swiss-Prot entry 3Keywords

Features

Sequence

Page 19: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Swiss-Prot annotation mainly found in:

• Description (DE) lines – Protein name/function

• Comment (CC) lines – e.g. function, subcellular location, pathway, cofactor, disease,

etc.

• Feature table (FT) – features on the sequence, e.g. domain, active site, modifications,

variations, etc.

• Keyword (KW) lines – Set of a few hundred controlled vocabulary terms

Page 20: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Other parts to UniProt

• UniParc –archive of all sequences

• UniProt –Swiss-Prot + TrEMBL

• UniProt NREF100 (100% seqs merged)

• UniProt NREF90 (90% seqs merged)

• UniProt NREF50 (50% seqs merged)

Page 21: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Submitting sequences to EMBL or UniProt

WEB-IN -web-based submission tool for submitting DNA sequences to EMBL database.

Protein sequences submitted when the peptides have been directly sequenced. Submit through SPIN

Page 22: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Sequence formats

• Not MSWord, but text!• Most include an ID/name/annotation of some sort• FASTA, E.g.

>xyz some other comment ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcgctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgccagatcaaggctcatgtagcctcactgg

Others specific to programs, e.g. GCG, abi, clustal, etc.

Page 23: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Literature database: PubMed/Medline

• Source of Medical-related & scientific literature• PubMed has articles published after 1965• Can search by many different means, e.g. author,

title, date, journal etc., or keywords for each• Can save queries and results• Can usually retrieve abstracts and full papers• PubMed has list of tags to search specific fields,

e.g. [AU], [TI], [DP] etc.

Page 24: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Search fields in PubMed

• Title Words [TI] MeSH Terms [MH] • Title/Abstract Words [TIAB] Language [LA]• Text Words [TW] Journal Title [TA]• Substance Name [NM] Issue [IP]• Subset [SB] Filter [FILTER]• Secondary Source ID [SI] Entrez Date [EDAT]• Subheadings [SH] EC/RN Number [RN]• Publication Type [PT] Author Name [AU]• Publication Date [DP] All Fields [ALL]• Personal Name as Subject [PS] Affiliation [AD]• Page Number [PG] Unique Identifiers [UID]• Title Words [TI] MeSH Major Topic [MAJR]• MeSH Date [MHDA]

Page 25: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Taxonomy Databases

• Most used is NCBI’s taxonomy database: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Taxonomy

• Provides entries for all known organisms• Provides taxonomic lineage and translation table for

organisms• Sequence entries for organism• UniProt-specific taxonomy database is Newt: • http://www.ebi.ac.uk/newt

Page 26: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Example taxonomy entry

Page 27: Biological databases Nicky Mulder: nicola.mulder@uct.ac.za.

Where to find the databases

• Table of addresses for major databases and tools

• Nucleic Acids Research Database issue January each year

• Nucleic Acids Research Software issue –new

• Amos’s list of tools: http://www.expasy.ch/alinks.html