MCB, feb 2005 EMBnet An introduction to biological databases [email protected]MCB, feb 2005 EMBnet What is a database ? • A collection of – structured – searchable (index) -> table of contents – updated periodically (release) -> new edition – cross-referenced (hyperlinks ) -> links with other db data • Includes also associated tools (software) necessary for db access/query, db updating, db information insertion, db information deletion….
74
Embed
An introduction to biological databases - EMBnet … MCB, feb 2005 An introduction to biological databases [email protected] EMBnet MCB, feb 2005 What is a database ?
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
– structured– searchable (index) -> table of contents
– updated periodically (release) -> new edition
– cross-referenced (hyperlinks) -> links with other db
data
• Includes also associated tools (software) necessary for db access/query, db updating, dbinformation insertion, db information deletion….
MCB, feb 2005EMBnet
Why biological databases ?
• Exponential growth in biological data.
• Data (genomic sequences, 3D structures, 2D gel analysis, MS analysis, Microarrays….) are no longer published in a conventional manner, but directly submitted to databases.
Sequence Databases: some « technical » definitions
Data storage management: – flat file: text file, human readable– relational database (e.g., Oracle, Postgres) – object oriented database
Sequence format (for BLAST, prediction tools…) - Fasta, RAW– GCG– NBRF/PIR– MSF…. – standardized format ?
Sequence database : format
ID EPO_HUMAN STANDARD; PRT; 193 AA.AC P01588; Q9UHA0; Q9UEZ5; Q9UDZ0;DT 21-JUL-1986 (Rel. 01, Created)DT 21-JUL-1986 (Rel. 01, Last sequence update)DT 20-AUG-2001 (Rel. 40, Last annotation update)DE Erythropoietin precursor.GN EPO.OS Homo sapiens (Human).OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.OX NCBI_TaxID=9606;RN [1]RP SEQUENCE FROM N.A.RX MEDLINE=85137899; PubMed=3838366;RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J.,RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F.,RA Kawakita M., Shimizu T., Miyake T.;RT "Isolation and characterization of genomic and cDNA clones of humanRT erythropoietin.";RL Nature 313:806-810(1985).….CC -!- FUNCTION: ERYTHROPOIETIN IS THE PRINCIPAL HORMONE INVOLVED IN THECC REGULATION OF ERYTHROCYTE DIFFERENTIATION AND THE MAINTENANCE OF ACC PHYSIOLOGICAL LEVEL OF CIRCULATING ERYTHROCYTE MASS.CC -!- SUBCELLULAR LOCATION: SECRETED.CC -!- TISSUE SPECIFICITY: PRODUCED BY KIDNEY OR LIVER OF ADULT MAMMALSCC AND BY LIVER OF FETAL OR NEONATAL MAMMALS.CC -!- PHARMACEUTICAL: Available under the names Epogen (Amgen) andCC Procrit (Ortho Biotech).…DR EMBL; X02158; CAA26095.1; -.DR EMBL; X02157; CAA26094.1; -.DR EMBL; M11319; AAA52400.1; -.DR EMBL; AF053356; AAC78791.1; -.DR EMBL; AF202308; AAF23132.1; -.DR EMBL; AF202306; AAF23132.1; JOINED.….
• The 3 main public nucleic acid sequence databases are EMBL (Europe)/GenBank (USA) /DDBJ (Japan)« different views of the same data set » within 2 to 3 days (since1990)
• EMBL: since 1982
• Specialized databases for the different types of RNAs (i.e. tRNA, rRNA, tm RNA, uRNA, etc…)
Remark: Genome-centric databases give usually access to several genomes, but some are « specialized » in particular organisms, i.e. TIGR: bacteria and plants
Ensembl provides a bioinformatics framework to organise biology around the sequences of large genomes.
Available now are:human, mouse, rat, fugu, zebrafish, mosquito, Drosophila, C. elegans, and C. briggsae, chicken…
http://www.ensembl.org/
MCB, feb 2005EMBnet
Ensembl/martview: example of queries
- Retrieve all mouse homologues of human disease genes containingtransmembrane domains located between 1p22 and 1q22
- Retrieve the sequences 5kb upstream of all human « known » genes fromchromosome 6
….
UCSC Genome Browser:http://genome.cse.ucsc.edu/
(human, mouse, rat chimpanzee, mouse, rat, chicken, Fugu,
Drosophila, C. briggsae, yeast, and SARS genomes. )
http://www.tigr.org/tdb/
..and plants
Database 1b: protein sequences• SWISS-PROT: created in 1986 (A.Bairoch) http://www.expasy.org/sprot/
• TrEMBL: created in 1996; complement to SWISS-PROT; derivedfrom EMBL CDS translations (« proteomic » version of EMBL)
• (PIR-PSD: Protein Information Resources) http://pir.georgetown.edu/
• Genpept: « proteomic » version of GenBank (~TrEMBL)• RefSeq (NP_)• PRF
• Many specialized protein databases for specific families or groups of proteins.
a central ressource for protein sequencesand function…
MCB, feb 2005EMBnet
Since december 15, 2003
Swiss-Prot and TrEMBL constitute the
Knowledgebase
(integration of the PIR data)(Protein Information Ressource)
Real life of a protein sequence …
TrEMBL
CoDing Sequencesprovided by submitters*
cDNAs, ESTs, genomes, …
EMBL
Data not submitted to public databases, delayed or cancelled…
Swiss-Prot
Manually annotated
Nucleic acids
Amino acids
with or without annotated CDS
Direct submission (< 1%)
PIR data
* ~ 1/10 EMBL entry is associatedwith an annotated CDS
(ESTs are not..)
MCB, feb 2005EMBnet
-> give access to
all known* protein sequences
* submitted to the public databases (EMBL, GenBank, DDJB, Swiss-Prot)
EMBLSwiss-Prot
TrEMBL
CDS
Swiss-Prot
Annotation of sequence differences(conflicts, variants, splicing…)
Once in Swiss-Prot, no more in TrEMBL-> Minimal redundancy
EMBL
TrEMBL
CDS
Average of 4.2 independent sequence reports for each human protein
EMBL
TrEMBL
CDS
Swiss-Prot
MCB, feb 2005EMBnet
Up-to-date sources:
Swiss-Prot -> ExPASy(www.expasy.org);
TrEMBL -> EBI (European Bioinformatics Institute)
(www.ebi.ac.uk/trembl/).
Since 1986
Since 1996
www.expasy.org
ExPASy EBI NCBI
MCB, feb 2005EMBnet
In a Swiss-Prot entry, you can expect to find:
• All the names of a given protein (and of its gene);• Its biological origin with links to the taxonomic databases;• A selection of references;• A summary of what is known about the protein: function,
alternative products, PTM, tissue expression, disease, etc.…;• Numerous cross-references;• Selected keywords;• A description of important sequence features: domains,
PTMs, variations, etc.;• A (often corrected) protein sequence and the description of
various isoforms/variants.
MCB, feb 2005EMBnet
View « by default » onthe ExPASy server
ReferencesRN, RP, RC, RX, RA, RL lines
CommentsCC lines
FeaturesFT lines
SequenceSQ lines
Names and taxonomyDE, GN, OC, OS, OG lines
Cross-referencesDR lines
KeywordsKW lines
Accession numberID, AC, DT lines
Sequencing errors ?
Polymorphisms ?Alternative splicing ?
Alternative initiation ?
Usage of an alternative promoter ?
RNA editing ?
Sequence quality
Selenocystein ?
Fragment ?
Same gene ?
-> 1 gene / 1 specie = 1 Swiss-Prot entryFor human: ~ 4,2 different independent sequence reports /gene
-> Identification and annotation of all sequence differences
MCB, feb 2005EMBnet
Annotation (Comment lines)• Function(s) and role(s); enzymes: a. Catalytic activity (if EC number)
b. Cofactorc. Enzyme regulationd Pathway
• Subunit (Protein/protein interactions)• Subcellular location• Alternative products (alt. splicing, alt. initiation, RNA editing)• Tissue specificity (Nothern and Western results)• Developmental stage• Induction• Domain• Post-translational modifications (PTM)• Mass spectrometry• Polymorphisms• Disease• Pharmaceutical• Miscellaneous• Similarities• Caution• Database (specialized cross-references)
MCB, feb 2005EMBnet
Information is derived from:
• Publications; currently Swiss-Prot cites 1'500 different journals. 106 journals are cited more than 100 times.
• Databases;
• Personal communication;
• Prediction;
• Brain storming…
Annotation/Curation (Comment lines)
MCB, feb 2005EMBnet
ICOL_HUMAN, O75144
Experimental qualifiers:« - »: experimentally proved;« By similarity »: experimentally proved in an ortholog or in anothermember of the family;« Probable »: not proved, but realistic; « Potential »: predicted (bioinformatic tools).
MCB, feb 2005EMBnet
Experimental qualifiers:« - »: experimentally proved;« By similarity »: experimentally proved in an ortholog or in anothermember of the family;« Probable »: not proved but realistic; « Potential »: predicted (bioinformatic tools).
BRH2_HUMAN, Q9NY43
AAA1_HUMAN, Q9NS82
Cross-references
• Explicit links to about 50 databases;• Implicit X-references to 30 additional db added by theExPASy servers on the WWW (such as GenBank, Ensembl, …)
=> links to more than 80 databases from theExPASy servers• Currently 1.5x106 cross-references in Swiss-Prot
-> Connected with practically all the databases indexedunder SRS.
Gasteiger et al., Curr. Issues Mol. Biol. (2001), 3(3): 47-55
Domains, functional sites, protein familiesPROSITEInterProPfamPRINTSProDom SMART
2D-gel protein dbsSWISS-2DPAGEANU-2DPAGEECO2DBASEHSC-2DPAGEAarhus and GhentMAIZE-2DPAGEPHCI-2DPAGEPMMA-2DPAGECOMPLUYEAST-2DPAGESiena-2DPAGE
PTMCarbBankGlycoSuiteDB
Human diseasesMIM
Swiss-Prot: a central hub for molecular biology
information
MCB, feb 2005EMBnet
Cross-references
ADN(Index of lowredundancy)
3D
genomic
ICE8_HUMAN Q14790
Examples of implicit links to GenBank and DDBJ added
‘on the fly’by the ExPASy server
FT lines = Feature table = Sequence description
Data derived from:• Publications;• Databases;• Personal communication;• Prediction.
General topology
ICOL_HUMAN, O75144
ICOL_HUMAN, O75144
PTM
Sequence description:
Derived from:• Publications;• Databases;• Personal communication;• Prediction.
MCB, feb 2005EMBnet
BRC2_HUMAN, P51587
Polymorphisms
Differences between thesequence shown and othersubmitted sequences
Polymorphisms
ICOL_HUMAN, O75144
Alternative splicing
Sequence description:
Derived from:• Publications;• Databases;• Personal communication;• Prediction.
All the alternatively splicedsequences are available for BLAST
searches and proteomic tools atthe ExPASy server
170’000 + 1’600’000 ≈ 1’200’000
Swiss-Prot & TrEMBLintroduce a new arithmetical concept !
Redundancy in TrEMBL&
Redundancy between TrEMBL and Swiss-Prot• In 2 years….more than 2’000’000 protein sequences• But, in the future: redundancy is going to decrease:
« new » genome sequencing -> « new » proteins(AB, sept 2002)
In the case of human proteins, the redundancy is still very high:
11’900 + 45’000 ≈ about 22’000*
Are missing:• Sequences not submitted to EMBL/GenBank/DDJB (and PIR)• Not yet predicted or known genes (« no CDS provided by the submitters» or no DNA sequence)• Confidential data (Patent application sequences)• Immunoglobulins, T-cell receptors (-> UniParc)•…
* human gene number estimation:25’000-35’000
MS proteomics has verified more than 10% of human genesproducts, but has not identified significant numbers ofunpredicted proteins (Southan C, Proteomics, 2004)
UniRef100UniRef90UniRef50
UniProt Archives(UniParc)
Gives access to archived rawprotein sequences, found in publicly accessible databases:
Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI, PDB,
RefSeq, FlyBase, WormBase, Patent Offices.
UniProt knowledgebase
TrEMBLComputer annotated protein sequences
Release 29.1 of 15-Feb-2005: 1’614’107 entries
Submitted CDS are automatically integrated into TrEMBL, and coupled to:• Merge of 100% identical sequences derived from thesame organism,• Protein family and domain attribution (InterPro),• Automated annotation.
UniParc permits thetracking of a protein
sequence and itsintegration into various
databases.
One UniRef100 entry -> allidentical sequences (including fragments) -reduction of 12%
One UniRef90 entry -> sequences that have at least 90% or more identity -reduction of 40%.
One UniRef50 entry -> sequences that are at least50 % identical -reduction of 70%.
Independently of the species.
Three collections of sequence clusters from UniProt EnsEMBL,
IPI, EMBL_WGS
UniProt (Universal Protein Resource): the world's most comprehensive catalog of information on proteins.
www.uniprot.org
UniRef is useful forcomprehensive BLAST
sequence searches by providing sets of representative sequences.
Use with extremecaution: also containspseudogenes, incorrect
CDS predictions, etc…andhighly redundant !
Swiss-ProtManually annotated protein sequencesRelease 46.1 of 15-Feb-2005: 170’140 entries
TrEMBL sequences are manually integrated intoSwiss-Prot:
• Merge of variants (polymorphisms, alternative splicing, RNA editing, etc.) -> low redundancyand high accuracy of the protein sequence;
• Integration of biological and medical data derived from high-performance bioinformatictools, as well as publications, external expertise, etc. -> high-quality manual annotation;
• Central hub for biological data: more than 80 links to relevant databases.
UniProt consortium= + +
Integration of
PIR data
Joining the information contained in Swiss-Prot, TrEMBLand PIR-PSD
> 95 % of proteins identifed by proteomic studiesare in Swiss-Prot
MCB, feb 2005EMBnet
Take home message
• Be aware of the differences betweenTrEMBL and Swiss-Prot.
“Sequences are rarely deposited in a “mature” state; as with all scientific research, DNA and protein annotation is a continual process of learning, revision and corrections.”
“Sequencing error rates: ~1 base in 10’000”
“Making people aware of errors is good and great; making people aware that they’re responsible also for correcting
errors is even greater”
C. Hardley, EMBO reports, 4(9), 2003.
MCB, feb 2005EMBnet
Protein sequence databases
The NCBI-nr pathway(Entrez protein)
Real life of a protein sequence …
Genpept
cDNAs, ESTs, genomes, …
EMBL, GenBank, DDBJ
Data not submitted to public databases, delayed or cancelled…
• Contain informations on sequence variations linked or not to genetic diseases;
• Mainly human but: OMIA - Online Mendelian Inheritance in Animals• General db:
– OMIM– HMGD - Human Gene Mutation db – SVD - Sequence variation db – HGBASE - Human Genic Bi-Allelic Sequences db – dbSNP - Human single nucleotide polymorphism (SNP) db
• Disease-specific db: most of these databases are either linked to a single gene or to a single disease;– p53 mutation db – ADB - Albinism db (Mutations in human genes causing albinism) – Asthma and Allergy gene db – ….
MCB, feb 2005EMBnet
For human (Amos’link)
MCB, feb 2005EMBnet
MCB, feb 2005EMBnet
Mutation/polymorphism• No single source for all SNPs (~100 SNPs db ) !
• Generally modest size; lack of coordination and format standards in these databases making it difficult to access the data.
• ! Numbering of the mutated amino acid depends on the db (aa no 1 is notnecessary the initiator Met !)
• There are initiatives to unify these databases (politic/founding problems)Mutation Database Initiative (4th July 1996).
-> SVD - Sequence Variation Database project at EBI (HMutDB)http://www.ebi.ac.uk/mutations/central/
-> HUGO Mutation Database Initiative (MDI).Human Genome Variation Society http://www.genomic.unimelb.edu.au/mdi/dblist/dblist.html
MCB, feb 2005EMBnet
Categories of databases for Life Sciences
• Sequences (DNA, protein)• Genomics• Mutation/polymorphism• Protein domain/family (----> tools)
• Proteomics (2D gel, Mass Spectrometry)• 3D structure• Metabolism• Bibliography• ‘Others’ (Microarrays, Protein protein interaction…)
MCB, feb 2005EMBnet
Protein domain/family: some definitions
• Most proteins have « modular » structures• Estimation: ~ 3 domains / protein
• Domains (conserved sequences or structures) are identified by multiple sequence alignments
• Domains can be defined by different methods: – Pattern (regular expression); used for very conserved domains– Profiles (weighted matrices): two-dimensional tables of
position specific match-, gap-, and insertion-scores, derived from aligned sequence families; used for less conserved domains
– Hidden Markov Model (HMM); probabilistic models; an othermethod to generate profiles.
• Contains biologically significant « pattern / profiles/ HMM » formulated in such a way that, with appropriate computional tools, it can rapidlyand reliably determine to which known family ofproteins (if any) a new sequence belongs to
• Used as a tool to identify the function ofuncharacterized proteins translated from genomicor cDNA sequences (« functional diagnostic »)
• Either manually curated (i.e. PROSITE, PfamA, PRINTS, SMART, TIGRFAM etc.) or automatically generated (i.e. PfamB, ProDom, DOMO)
• Sequences (DNA, protein)• Genomics• Mutation/polymorphism• Protein domain/family (----> tools)
• Proteomics (2D gel, Mass Spectrometry)• 3D structure• Metabolism• Bibliography• ‘Others’ (Microarrays, Protein protein interaction…)
MCB, feb 2005EMBnet
Databases 5: proteomics
• Contain informations obtained by 2D-PAGE: images of master gels and description of identified proteins
• Examples: SWISS-2DPAGE, ECO2DBASE, Maize-2DPAGE, Sub2D, Cyano2DBase, etc.
• Composed of image and text files
• There is currently no protein Mass Spectrometry (MS) database (not for long…)
This protein does not exist in the current release of SWISS-2DPAGE.
Theoritically computed pI and MW
Theoritically computed pI and MW withpotential phosphorylation and acetylation sites
Experimentally determined position
MCB, feb 2005EMBnet
Categories of databases for Life Sciences
• Sequences (DNA, protein)• Genomics• Mutation/polymorphism• Protein domain/family (----> tools)
• Proteomics (2D gel, Mass Spectrometry)• 3D structure• Metabolism• Bibliography• ‘Others’ (Microarrays, Protein protein interaction…)
MCB, feb 2005EMBnet
Databases 6: 3D structure• Contain the spatial coordinates of macromolecules
whose 3D structure has been obtained by X-ray or NMR studies
• Proteins represent more than 90% of available structures (others are DNA, RNA, sugars, viruses, protein/DNA complexes…)
• Only one: PDB (Protein Data Bank),
MCB, feb 2005EMBnet
PDB: Protein Data Bankwww.rcsb.org/pdb/
• Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA).
• Contains structure data on proteins, nucleic acids, protein-nucleic acid complexes, and viruses.
• Associated with specialized programs allow the visualizationof the corresponding 3D structure (e.g., SwissPDB-viewer, Chime, Rasmol)).
• Currently there are ~29’500 structural data for about 8’000 different proteins, but far less protein family (highly redundant) !
MCB, feb 2005EMBnet
PDB: example
HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6REVDAT 1 15-OCT-92 12CA 0 12CA 7JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13REMARK 1 12CA 14REMARK 2 12CA 15REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16REMARK 3 12CA 17REMARK 3 REFINEMENT. 12CA 18REMARK 3 PROGRAM PROLSQ 12CA 19REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20REMARK 3 R VALUE 0.170 12CA 21REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23REMARK 4 12CA 24REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27………
MCB, feb 2005EMBnet
PDB (cont.)SHEET 3 S10 PHE 66 PHE 70 -1 O ASN 67 N LEU 60 12CA 68SHEET 4 S10 TYR 88 TRP 97 -1 O PHE 93 N VAL 68 12CA 69SHEET 5 S10 ALA 116 ASN 124 -1 O HIS 119 N HIS 94 12CA 70SHEET 6 S10 LEU 141 VAL 150 -1 O LEU 144 N LEU 120 12CA 71SHEET 7 S10 VAL 207 LEU 212 1 O ILE 210 N GLY 145 12CA 72SHEET 8 S10 TYR 191 GLY 196 -1 O TRP 192 N VAL 211 12CA 73SHEET 9 S10 LYS 257 ALA 258 -1 O LYS 257 N THR 193 12CA 74SHEET 10 S10 LYS 39 TYR 40 1 O LYS 39 N ALA 258 12CA 75TURN 1 T1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12CA 76TURN 2 T2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12CA 77TURN 3 T3 ALA 134 GLN 137 TYPE I (GLN 136) 12CA 78TURN 4 T4 GLN 137 GLY 140 TYPE I (ASP 139) 12CA 79TURN 5 T5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12CA 80TURN 6 T6 GLY 233 GLU 236 TYPE II (GLY 235) 12CA 81CRYST1 42.700 41.700 73.000 90.00 104.60 90.00 P 21 2 12CA 82ORIGX1 1.000000 0.000000 0.000000 0.00000 12CA 83ORIGX2 0.000000 1.000000 0.000000 0.00000 12CA 84ORIGX3 0.000000 0.000000 1.000000 0.00000 12CA 85SCALE1 0.023419 0.000000 0.006100 0.00000 12CA 86SCALE2 0.000000 0.023981 0.000000 0.00000 12CA 87SCALE3 0.000000 0.000000 0.014156 0.00000 12CA 88ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102…….
Coordinates of each atom
The same PDB entry “visualized” with Chime
MCB, feb 2005EMBnet
Industry of databases around PDB
- HSSP: Homology-derived secondary structure of proteins. http://www.sander.ebi.ac.uk/hssp/
- Structure classification-CATH-SCOP-…
- Homology-derived 3D structure db: Swiss-Model Redepository (SMR): feb 2005: 555’900 models.
MCB, feb 2005EMBnet
http://swissmodel.expasy.org/repository/
Annotated 3D comparative protein structure models generated by the fully automated homology-modelling pipeline SWISS-MODEL.
Precompute the 3D model of protein domains (~200 amino acids, biggest model: 1500 aa) which share about 40 % similarity with a 3D experimentally determined template.
MCB, feb 2005EMBnet
Categories of databases for Life Sciences
• Sequences (DNA, protein)• Genomics• Mutation/polymorphism• Protein domain/family (----> tools)
• Proteomics (2D gel, Mass Spectrometry)• 3D structure• Metabolism• Bibliography• ‘Others’ (Microarrays, Protein protein interaction…)
MCB, feb 2005EMBnet
Databases 7: metabolic
• Contain informations that describe enzymes, biochemical reactions and metabolic pathways;
• ENZYME and BRENDA: nomenclature databases that store informations on enzyme names and reactions;
• Metabolic databases: EcoCyc (specialized on Escherichia coli), KEGG, EMP/WIT;Usually these databases are tightly coupled with query software that allows the user to visualise reaction schemes.
MCB, feb 2005EMBnet
• There are about 3750 “EC numbers”~ 1900 are linked to Swiss-Prot sequence~ 200 are linked to a TrEMBL sequence
~ 1450 can not be linked to any sequence !
BRENDAUseful to preparelab’s experiments !
http://www.brenda.uni-koeln.de/
MCB, feb 2005EMBnet
IntEnz = Enzyme + BRENDA + NC-IUBMB nomenclaturehttp://www.ebi.ac.uk/intenz/index.html
http://www.genome.ad.jp/kegg
MCB, feb 2005EMBnet
Categories of databases for Life Sciences
• Sequences (DNA, protein)• Genomics• Mutation/polymorphism• Protein domain/family (----> tools)
• Proteomics (2D gel, Mass Spectrometry)• 3D structure• Metabolism• Bibliography• ‘Others’ (Microarrays, Protein protein interaction…)
MCB, feb 2005EMBnet
Databases 8: bibliographic
• Bibliographic reference databases contain citations and abstract informations of published life science articles;
• Example: Medline• Other more specialized databases also exist
Medline• Comprehensive database of primary scientific literature in the
biomedical area.
• More than 4,000 biomedical journals published in the United States and 70 other countries
• Contains over 15 million indexed citations since 1966 until now
• Citations prior to the mid-1960s are located in OLDMEDLINE.
• Contains links to biological db– Many papers not dealing with humans are not in Medline !– Before 1970, keeps only the first 10 authors !– Not all journals have citations since 1966 ! (they go back…)
• Maintained by the US National Library of Medicine.
• Allows access to the citations from MEDLINE and additional life science journals.
• Includes links to many sites providing full text articles and other related resources.
• Gives also access to :- In Process Citations – Publisher supplied citations: citations directly submitted
to PubMed ([Record as supplied by publisher]).
• PMID (PubMed ID) UI (Medline ID)
MCB, feb 2005EMBnet
New:
DOI (Digital Object Identifier) are names (characters and/or digits) assigned to objects of intellectual property such as electronic journal articles, images, learning objects, ebooks, any kind of content.
Server: http://dx.doi.org
-> biggest advance to track documents on the web !
MCB, feb 2005EMBnet
Categories of databases for Life Sciences
• Sequences (DNA, protein)• Genomics• Mutation/polymorphism• Protein domain/family (----> tools)
• Proteomics (2D gel, Mass Spectrometry)• 3D structure• Metabolism• Bibliography• ‘Others’ (Microarrays, Protein protein interaction…)
MCB, feb 2005EMBnet
Databases 9: others
• There are many databases that cannot be classified in the categories listed previously;
• Examples: ReBase (restriction enzymes), TRANSFAC (transcription factors), CarbBank, GlycoSuiteDB (linked sugars), Protein-protein interactions db (Intact, BIND), Protease db (MEROPS), biotechnology patents db, etc.;
• As well as many other resources concerning any and new aspects of macromolecules and molecular biology (Microarrays).
MCB, feb 2005EMBnet
Amos links: Microarrays
MCB, feb 2005EMBnet
Interactome- Protein/protein interaction: description from 1 to more than 20’000 interactions / publication
- Several databases: Intact, BIND, DIP.
- Proteomics standard initiative since 2005
http://www.ebi.ac.uk/intact/index.html
MCB, feb 2005EMBnet
MCB, feb 2005EMBnet
Gene Ontology (GO) database
The Gene Ontology (GO) project (http://www.geneontology.org/) provides structured, controlled vocabularies and classifications that cover several
domains of molecular and cellular biology and are freely available for community use in the annotation of genes, gene products and sequences.
The three organizing principles of GO are molecular function (MF), biological process (BP) and cellular component (CC).
MCB, feb 2005EMBnet
Proliferation of databases
• Which does contain the highest quality data ?• Which is the more comprehensive ?• Which is the more up-to-date ?• Which is the less redundant ?• Which is the more indexed (allows complex
queries) ?• Which Web server does respond most quickly ?• …….??????
MCB, feb 2005EMBnet
To benefit from the data stored in a database, we need:
• easy access to the information
-> a method for extracting only that information needed to answer a specific biological question
Examples: Entrez (NCBI), SRS (Europ), tools such as BLAST, Peptident…
MCB, feb 2005EMBnet
Some important practical remarks
• Databases: many errors (automatedannotation) !
• Not all db are available on all servers• The update frequency is not the same for
all servers; • Some servers add automatically cross-
references to an entry (implicit links) in addition to already existing links (explicit links)…different looks…