Protein Database Bioinformatics Lab
Jan 16, 2016
Protein Database
Bioinformatics Lab
Sequence Databases
• GenBank--DNA sequences and derived protein
sequences • EMBL --DNA sequences and derived protein
sequences
• DDBJ --DNA sequences and derived protein sequences • SWISS-PROT--Protein sequences
• PDB--three-dimensional structures of protein
• GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences .
• A new release is made every two months. GenBank is part of the International Nucleotide Sequence Database Collaboration, which is comprised of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI.
• These three organizations exchange data on a daily basis.
GenBank,EMBL & DDBJ
GenBank,EMBL & DDBJ
• GenBank Release 122.0,Feb.15,2001. 10,897,000 sequence records
11,720,000,000 bases • EMBL Release 66,Mar.2,2000
11,169,673
11,916,112,872 • DDBJ,the Center for operating DDBJ, National
Institute of Genetics (NIG),Japan,established in April 1995.
Protein Databases
There are many styles in protein databases,such as protein sequences,motif,classification,structure, structure alignment, curation
• GenBANK,EMBL and DDBJ(derived sequences, http://www.ncbi.nlm.nih.gov/gorf/gorf.html)
• SWISS-PROT,PIR (sequences)• PROSITE,PRINTS(sequence motifs)• HSSP,FSSP(classification,alignment)• PDB(3-D structure)
SWISS-PROT/TrEMBL• Annotated protein sequences,• Established in 1986• Developed by the SWISS-PROT groups at SIB
and at EBI. • Maintained collaboratively, since 1987, by the
Department of Medical Biochemistry of the University of Geneva( 日内瓦) and the EMBL Data Library (now the EMBL Outstation - The European Bioinformatics Institute (EBI)).
• Website: http://www.expasy.ch/
Different Features of SWISS-PROT
• Format follows as closely as possible that of EMBL’s
• Curated protein sequence database• Three differences:1. Strives to provide a high level of annotations
(力争)2. Minimal level of redundancy (冗余最少)3. High level of integration with other databases
(综合性高)
Three Distinct Criteria 1. AnnotationThe sequence data; the citation information
(bibliographical references) and the taxonomic data (description of the biological source of the protein) such as protein functions,post-translational modifications ,domains and sites,secondary structure,quaternary structure,similarities to other proteins,diseases associated with deficiencies in the protein,sequence conflicts, variants, etc.
2. Minimal Redundancy
Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. SWISS-PROT is as much as possible to merge all these data so as to minimize the redundancy. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry.
3. Integration With Other Databases • SWISS-PROT and TrEMBL - Protein sequences • PROSITE - Protein families and domains • SWISS-2DPAGE - Two-dimensional
polyacrylamide gel electrophoresis 聚丙烯酰胺电泳• SWISS-3DIMAGE - 3D images of proteins and
other biological macromolecules • SWISS-MODEL Repository - Automatically
generated protein models • CD40Lbase - CD40 ligand defects (配合体缺失)
• ENZYME - Enzyme nomenclature (酶命名)• SeqAnalRef - Sequence analysis bibliographic
references (序列分析目录参考)
SWISS-PROT/TrEMBL
• TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT
• SWISS-PROT Release 39.15 of 19-Mar-2001: 94,152 entriesTrEMBL Release 16.2 of 23-Mar-2001: 436,924 entries
SWISS-PROT FORMATLine code Content Occurrence in an entry
ID Identification Once; starts the entry
AC Accession number(s) One or more
DT Date Three times
DE Description One or more
GN Gene name(s) Optional
OS Organism species One or more
OG Organelle Optional
OC Organism classification One or more
RN Reference number One or more
RP Reference position One or more
RC Reference comment(s) Optional
RX Reference cross-reference(s) Optional
RA Reference authors One or more
RT Reference title Optional
RL Reference location One or more
CC Comments or notes Optional
DR Database cross-references Optional
KW Keywords Optional
FT Feature table data Optional
SQ Sequence header Once
(blanks) sequence data One or more
// Termination line Once; ends the entry
Access to SWISS-PROT and TrEMBL
• SRS - Access to SWISS-PROT, TrEMBL and other databases using the Sequence Retrieval System
• Full text search in SWISS-PROT and TrEMBL • by accession number or ID (AC or ID line; SWISS-PROT
and TrEMBL) • by description or identification (any word in the DE, OS,
OG, GN and ID lines; SWISS-PROT and TrEMBL) • by author (RA line; SWISS-PROT and TrEMBL) • by citation (RL line; SWISS-PROT only) • Retrieve a list of SWISS-PROT/TrEMBL entries • Randomly retrieve a SWISS-PROT/TrEMBL entry
Protein Data Bank• PDB is three-dimensional structure of
proteins,some nuclei acids involved • PDB is operated by RCSB(Research Collaboratory
for Structural Bioinformatics),funded by NSF, DOE, and two units of NIH:NIGMS National Institute Of General Medical Sciences and NLM National Library Of Medicine.
• Established at BNL Brookhaven National Laboratories in 1971,as an archive for biological macromolecular crystal structures
• In 1980s, the number of deposited structures began to increase dramatically.
• October 1998, the management of the PDB became the responsibility of RCSB.
• Website http://www.rcsb.org
PDB Holdings List: 27-Mar-2001
Molecule Type
Proteins, Peptides, and Viruses
Protein/
Nucleic Acid Complexes
Nucleic Acids
Carbohydrates total
Exp.
Tech.
X-ray Diffraction and other
11045 526 552 14 12137
NMR 1832 71 366 4 2273
Theoretical Modeling
281 19 21 0 321
total 13158 616 939 18 14731
5032 Structure Factor Files968 NMR Restraint Files
PDB Content Growth
PDB Growth in New Folds
PDB Data File Format
• There are mainly two formats:PDB and CIF
• PDB is fixed format in its columns
• CIF is free format
PDB Format• HEADER: First line of the entry, contains PDB ID code, classification, and date of
deposition. • OBSLTE : Statement that the entry has been removed from distribution and list of
the ID code(s) which replaced it. • TITLE : Description of the experiment represented in the entry. • CAVEAT : Severe error indicator. Entries with this record must be used with care. • COMPND : Description of macromolecular contents of the entry. • SOURCE : Biological source of macromolecules in the entry. • KEYWDS : List of keywords describing the macromolecule. • EXPDTA : Experimental technique used for the structure determination.• AUTHOR : List of contributors. • REVDAT : Revision date and related information. • SPRSDE : List of entries withdrawn from release and replaced by current entry.• JRNL : Literature citation that defines the coordinate set. • REMARK : General remarks, some are structured and some are free form. • DBREF : Reference to the entry in the sequence database(s). • SEQADV : Identification of conflicts between PDB and the named sequence
database. • SEQRES : Primary sequence of backbone residues. • MODRES : Identification of modifications to standard residues. • HET : Identification of non-standard groups or residues (heterogens) • HETNAM : Compound name of the heterogens. • HETSYN : Synonymous compound names for heterogens. • FORMUL : Chemical formula of non-standard groups. • HELIX : Identification of helical substructures.
• SHEET : Identification of sheet substructures. • TURN : Identification of turns. • SSBOND : Identification of disulfide bonds. • LINK : Identification of inter-residue bonds. • HYDBND : Identification of hydrogen bonds. • SLTBRG : Identification of salt bridges • CISPEP : Identification of peptide residues in cis conformation. • SITE : Identification of groups comprising important sites. • CRYST1 : Unit cell parameters, space group, and Z. • ORIGXn : Transformation from orthogonal coordinates to the submitted coordinates (n
= 1, 2, or 3). • SCALEn : Transformation from orthogonal coordinates to fractional crystallographic
coordinates (n = 1, 2, or 3). • MTRIXn : Transformations expressing non-crystallographic symmetry (n = 1, 2, or 3).
There may be multiple sets of these records. • TVECT : Translation vector for infinite covalently connected structures. • MODEL : Specification of model number for multiple structures in a single coordinate
entry. • ATOM : Atomic coordinate records for standard groups. • SIGATM : Standard deviations of atomic parameters. • ANISOU : Anisotropic temperature factors. • SIGUIJ : Standard deviations of anisotropic temperature factors. • TER : Chain terminator. • HETATM : Atomic coordinate records for heterogens. • ENDMDL : End-of-model record for multiple structures in a single coordinate entry.• CONECT : Connectivity records. • MASTER : Control record for bookkeeping. • END : Last record in the file.
An Example of PDBHEADER IMMUNOGLOBULIN 09-MAY-89 2MCG 2MCG 2COMPND IMMUNOGLOBULIN LAMBDA LIGHT CHAIN DIMER (/MCG$) 2MCG
3COMPND 2 (TRIGONAL FORM) 2MCG 4SOURCE HUMAN (HOMO $SAPIENS) 2MCG 5AUTHOR K.R.ELY,J.N.HERRON,A.B.EDMUNDSON 2MCG 6REVDAT 2 15-JUL-92 2MCGA 1 SPRSDE 2MCGA 1SPRSDE 15-OCT-90 2MCG 1MCG 2MCGA 2JRNL AUTH K.R.ELY,J.N.HERRON,M.HARKER,A.B.EDMUNDSON 2MCG 9JRNL TITL THREE-DIMENSIONAL STRUCTURE OF A LIGHT CHAIN 2MCG 10REMARK 1 REFERENCE 1 2MCG 16REMARK 1 AUTH A.B.EDMUNDSON,K.R.ELY,J.N.HERRON,B.D.CHESON 2MCG
17SEQRES 1 1 216 PCA SER ALA LEU THR GLN PRO PRO SER ALA SER GLY SER 2MCG
183
FORMUL 3 HOH *318(H2 O1) 2MCG 217SSBOND 1 CYS 1 22 CYS 1 90 2MCG 218CRYST1 72.300 72.300 185.900 90.00 90.00 120.00 P 31 2 1 6 2MCG 223ORIGX1 0.013831 0.007985 0.000000 0.00000 2MCG 224ORIGX2 0.000000 0.015971 0.000000 0.00000 2MCG 225ORIGX3 0.000000 0.000000 0.005379 0.00000 2MCG 226SCALE1 0.013831 0.007985 0.000000 0.00000 2MCG 227SCALE2 0.000000 0.015971 0.000000 0.00000 2MCG 228SCALE3 0.000000 0.000000 0.005379 0.00000 2MCG 229ATOM 1 N PCA 1 1 23.624 -24.231 101.873 1.00 17.85 2MCG 230ATOM 2 CA PCA 1 1 23.296 -22.902 102.481 1.00 17.38 2MCG 231ATOM 3 C PCA 1 1 24.304 -22.495 103.531 1.00 16.74 2MCG 232ATOM 4 O PCA 1 1 23.962 -21.756 104.487 1.00 16.81 2MCG 233ATOM 5 CB PCA 1 1 21.845 -23.057 103.035 1.00 18.02 2MCG 234ATOM 6 CG PCA 1 1 21.816 -24.552 103.492 1.00 18.36 2MCG 235ATOM 7 CD PCA 1 1 23.109 -25.217 102.974 1.00 18.57 2MCG 236ATOM 8 OE PCA 1 1 23.354 -26.423 103.256 1.00 19.02 2MCG 237TER 3214 SER 2 216 2MCG3443HETATM 3215 O HOH 1 26.302 -28.430 111.973 1.00 4.66 2MCG3444CONECT 145 144 660 2MCG3762MASTER 170 0 0 0 0 0 0 6 3530 2 10 34 2MCGA 5END 2MCG3773
Fragment of CIF example##################### ATOM_SITE #####################loop__atom_site.label_seq_id_atom_site.group_PDB_atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_comp_id _atom_site.label_asym_id _atom_site.auth_seq_id _atom_site.label_alt_id _atom_site.cartn_x _atom_site.cartn_y _atom_site.cartn_z _atom_site.occupancy_atom_site.B_iso_or_equiv _atom_site.footnote_id_atom_site.label_entity_id_atom_site.id1 ATOM N N GLY A 1 . -8.863 16.944 14.289 1.00 21.88 1
1 11 ATOM C CA GLY A 1 . -9.929 17.026 13.244 1.00 22.85 1
1 21 ATOM C C GLY A 1 . -10.051 15.625 12.618 1.00 43.92 1
1 31 ATOM O O GLY A 1 . -9.782 14.728 13.407 1.00 25.22 1
1 4
3-D Structure from PDB
• 20 Amino acids
http://www.clunet.edu/BioDev/omm/aa/aa.htm
http://www.nyu.edu/pages/mathmol/library/life/
http://inquiry.uiuc.edu/bioweb/tutorial/amino_acids.htm
Phenylalanine
Glycine Histidine Isoleucine Lysine
Leucine Methionine Asparagine
Proline Glutamine
Arginine Serine Threonine Valine Tryptophane
Glutamic acidAlanine CysteineAspartic
acidTryosine
How to Construct 3-D Molecule
• Read coordinates from PDB (找相配结构)• Set up data structure of molecules• Form bonds among atoms and groups• Calculate secondary structure• Implement 3-D graphical algorithms• Render 3-D graph in various style, wires,
sticks, balls, ribbons, and the like.
Bonds among atomsATOM 20 N LEU 1 4 30.279 -25.716 105.041 1.00 10.60 2MCG 249
ATOM 21 CA LEU 1 4 31.406 -26.518 104.496 1.00 9.39 2MCG 250
ATOM 22 C LEU 1 4 32.658 -25.786 105.165 1.00 8.90 2MCG 251
ATOM 23 O LEU 1 4 32.890 -24.586 104.967 1.00 8.74 2MCG 252
ATOM 24 CB LEU 1 4 31.615 -26.794 103.141 1.00 8.79 2MCG 253
ATOM 25 CG LEU 1 4 31.552 -27.440 101.860 1.00 8.37 2MCG 254
ATOM 26 CD1 LEU 1 4 32.732 -26.945 100.970 1.00 7.99 2MCG 255
ATOM 27 CD2 LEU 1 4 31.706 -28.963 102.016 1.00 8.09 2MCG 256
Leucine LEU L(亮氨酸)
Bonds between groups
ATOM 9 N SER 1 2 25.548 -22.930 103.333 1.00 16.05 2MCG 238ATOM 10 CA SER 1 2 26.608 -22.758 104.327 1.00 15.38 2MCG 239ATOM 11 C SER 1 2 27.351 -24.076 104.604 1.00 14.81 2MCG 240ATOM 12 O SER 1 2 27.530 -24.949 103.740 1.00 15.00 2MCG 241ATOM 13 CB SER 1 2 25.887 -22.406 105.682 1.00 15.73 2MCG 242ATOM 14 OG SER 1 2 25.193 -23.586 106.117 1.00 15.14 2MCG 243ATOM 15 N ALA 1 3 27.758 -24.228 105.876 1.00 13.72 2MCG 244ATOM 16 CA ALA 1 3 28.328 -25.397 106.456 1.00 12.33 2MCG 245ATOM 17 C ALA 1 3 29.255 -26.303 105.686 1.00 11.58 2MCG 246ATOM 18 O ALA 1 3 29.033 -27.552 105.641 1.00 11.28 2MCG 247ATOM 19 CB ALA 1 3 27.101 -26.228 106.998 1.00 12.39 2MCG 248ATOM 20 N LEU 1 4 30.279 -25.716 105.041 1.00 10.60 2MCG 249ATOM 21 CA LEU 1 4 31.406 -26.518 104.496 1.00 9.39 2MCG 250ATOM 22 C LEU 1 4 32.658 -25.786 105.165 1.00 8.90 2MCG 251ATOM 23 O LEU 1 4 32.890 -24.586 104.967 1.00 8.74 2MCG 252ATOM 24 CB LEU 1 4 31.615 -26.794 103.141 1.00 8.79 2MCG 253ATOM 25 CG LEU 1 4 31.552 -27.440 101.860 1.00 8.37 2MCG 254ATOM 26 CD1 LEU 1 4 32.732 -26.945 100.970 1.00 7.99 2MCG 255ATOM 27 CD2 LEU 1 4 31.706 -28.963 102.016 1.00 8.09 2MCG 256
Nucleic Acid Database(NDB)
• The NDB Project is funded by the National Science Foundation and the Department of Energy
• The goal of NDBP is to assemble and distribute structural information about nucleic acids
• The format of NDB is the same as PDB.
Molvie1.0
• A visual and interactive environment to display,analyze,fold and compare molecular structure.
• Developed in Java AWT by us.
• Java application/applet,really embedded in webpage.(http://www.cs.ucsb.edu/~mli/Bioinf/software/index.html)
Some features
• Molvie 1.0 is programmed in Java, hence it is platform-independent.
• There is no limit on the number of molecules, atoms, residues or the number of animation frames displayed, as long as there is enough in computer memory.
• Molvie has many rendering (表现) styles. • Molvie can display two molecules
simultaneously and allows the user to align secondary structure by dragging the mouse.
• Molvie also allows the users to click at some part of the 3-D structure of a protein and displays the corresponding primary amino acid sequences.
Molvie Application Screen
Molvie Applet Screen
Show Molvie