BioInformatics (1)
Apr 01, 2015
BioInformatics (1)
What is Life All About :Self-compiling & self-assembling
Complementary surfacesWatson-Crick base pair (Nature April 25, 1953)
Life Science vs ComputingWhere do parasites come from?
(computer & biological viral codes)
Over $12 billion/year on computer viruses
LoveBugSet dirtemp =3D fso.GetSpecialFolder(2)Set c =3D fso.GetFile(WScript.ScriptFullName)c.Copy(dirsystem&"\MSKernel32.vbs")c.Copy(dirwin&"\Win32DLL.vbs")c.Copy(dirsystem&"\LOVE-LETTER-FOR-YOU.TXT.vbs")regruns()html()spreadtoemail()listadriv()
20 M dead (worse thanblack plague & 1918 Flu)
AIDS - HIV-1 Polymerase drug resistance mutations
M41L, D67N, T69D, L210W, T215Y, H208Y PISPIETVPV KLKPGMDGPK VKQWPLTEEK
IKALIEICAE LEKDGKISKI GPVNPYDTPV
FAIKKKNSDK WRKLVDFREL NKRTQDFCEV
Concept Computers Organisms
Instructions Program GenomeBits 0,1 a,c,g,tStable memory ROM,Disk,tape DNAActive memory RAM RNAProcessing CPU/Compiler enzyme/RibosomeEditing Editor tRNAEnvironment Sockets,people Water,salts,heatI/O AD/DA proteinsMonomer Minerals NucleotidePolymer chip DNA,RNA,proteinReplication Cut/Paste DNA replicationSensor/In scanner Chem/photo receptor
Exciting Life ??
of RNA-based life: C,H,N,O,P Useful for many species:Na, K, Fe, Cl, Ca, Mg, Mo, Mn, S, Se, Cu, Ni, Co, Si
Elements
The Four Nucleosides of DNA
dA dG dC dT
A nucleoside is a sugar, here deoxyribose, plus a base
dA = deoxyadenosine, etc.
PYRIMIDINESPURINES
AdenineGuanine
Thymine Cytosine Uracil
BASES
Base Pairing
A nucleotide is a phospate, a sugar, and a purine or a pyramidine base.
The monomeric units of nucleic acids are called nucleotides.
Chromosomes
Genome and gene
Entity Definition Molecular Mechanisms Genome Unit of information transmission DNA replication
Gene Unit of information expression (a special sequence of nucleotide bases, whose sequences carry the information required for constructing protein)
Transcription to RNA Translation to protein
Nucleic acid and proteins
Macromolecule Backbone Repeating unit Length Role
DNA Phosphodiester bonds Deoxyribonucleotides (A, C, G, T)
103-108 Genome Nucleic acid RNA Phosphodiester bonds Ribonucleotides
(A, C, G, U) 103-105 103-104 102-103
Genome Messenger Gene product
Protein ( structure components of cells/tissues/enzymes)
Peptide bonds Amino acids (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y)
102-103 Gene product
Nucleotide codes
A Adenine W Weak (A or T)
G Guanine S Strong (G or C)
C Cytosine M Amino (A or C)
T Thymine K Keto (G or T)
U Uracil B Not A (G or C or T)
R Purine (A or G) H Not G (A or C or T)
Y Pyrimidine (C or T) D Not C (A or G or T)
N Any nucleotide V Not T (A or G or C)
Amino acid codes
AlaArgAsnAspCysGlnGluGlyHisIleLeuLysMetPheProSerThrTrpTyrVa lAsxGlxSecUnk
ARNDCQEGHILKMFPSTWYVBZUX
AlanineArginineAsparagineAspartic acidCysteineGlutamineGlutamic acidGlycineHistidineIsoleucineLeucineLysineMethioninePhenylalanineProlineSerineThreonineTryptophanTyrosineVa lineAsn or AspGln or GluSelenocysteineUnknown
StandardGenetic
Code
Schematic illustration of a plant cell(Home for DNA)
History of structure determination for nucleic acids and proteinsTechnology development Structure determination
195049 Edman degradation
54 Isomorphous replacement
-helix model
53 DNA double helix modelInsulin primary structure
1960
62 Restriction enzyme
60 Myoglobin tertiary structure
65 tRNAAla primary structure
1970
72 DNA cloning
75 DNA sequencing
73 tRNAPhe tertiary structure
77 X174 complete genome
79 Z-DNA by single crystal differentiation1980
84 Pulse field gel electrophoresis85 Polymerase chain reaction
87 YAC vec tor86 Protein structure by 2D NMR
88 Human Genome Project
1990
93 DNA chip
95 H influenzae complete genome
2000
Human chromosomes: idiograms
X-linked recessive disorder. The inheritance pattern is shown for a recessive gene on the chromosome X, designated in bold.
MaleXY
(normal)
FemaleXX
(normal)
Female XX(normal)
Female XX(normal)
Male XY(normal)
Male XY(affected)
Reductionistic and synthetic approaches in biology
Biological System
(Organism)
Building Blocks
(Genes/Molecules)
Synthetic
Approach
(Bioinformatics)
Reductionistic
Approach
(Experiments)
Basic principles in physics, chemistry and biology.
Principles Known?
Physics
Matter
Chemistry
Compound
Biology
Organism
ElementaryParticles
Yes
Elements
Yes
Genes
No
100 000
10 000
1000
100
101
0.1
0.01
1965 1970 1975 1980 1985 1990 1995
MEDLINE G5 MeSH
2000
Year
Am
ount
(x1
000)
0.001
Transistors / chipDNA sequencesMapped human genes3-D structures
MEDLINE records
The addresses for the major databases
Database Organization Address
MEDLINE National Library of Medicine www.nlm.nih.gov
GenBank National Center for Biotechnology Information www.ncbi.nlm.nih.gov
EMBL European Bioinformatics Institute www.ebi.ac.uk
DDBJ National Institute of Genetics, Japan www.ddbj.nig.ac.jp
SWISS-PROT Swiss Institute of Bioinformatics www.expasy.ch
PIR National Biomedical Research Foundation www-nbrf.georgetown.edu
PRF Protein Research Foundation, Japan www.prf.or.jp
PDB Research Collaboratory for Structural Bioinformatics www.rcsb.org
CSD Cambridge Crystallographic Data Centre www.ccdc.cam.ac.uk
New generation of molecular biology databases
Information Database Address
Compounds and reactions LIGANDAaindex
www.genome.ad.jp/dbget/ligand.htmlwww.genome.ad.jp/dbget/aaindex.html
Protein families andsequence motifs
PROSITEBlocksPRINTSPfamPro Dom
www.expasy.ch/sprot/prosite.htmlwww.blocks.fhcrc.org/www.biochem.ucl.ac.uk.bsm.dbbrowser/PRINTS/www.sanger.ac.uk/Pfam/,pfam.wustl.edu/protein.toulouse.inra.fr/prodom.html
3D fold classifications SCOPCATH
scop.mrc-lmb.cam.ac.uk/scop/www.biochem.ucl.ac.uk/bsm/cath/
Orthologous genes COGKEGG
www.ncbi.nlm.nih.gov/COG/www.genome.ad.jp/kegg/
Biochemical pathways KEGGWITEcoCycUM-BBD
www.genome.ad.jp/kegg/www.mcs.anl.gov/WIT2/ecocyc.PangeaSystems.com/ecocyc/www.labmed.umn.edu/umbbd/
Genome diversity NCBI TaxonomyOMIM
www.ncbi.nlm.nih.gov/Taxonomy/www.ncbi.nlm.nih.gov/Omim/
Example of sequence database entry for GenbankLOCUS DRODPPC 4001 bp INV 15-MAR-1990DEFINITION D.melanogaster decapentaplegic gene complex (DPP-C), complete cds.ACCESSION M30116KEYWORDS .SOURCE D.melanogaster, cDNA to mRNA.
ORGANISM Drosophila melanogasterEurkaryote; mitochondrial eukaryotes; Metazoa; Arthropoda;Tracheata; Insecta; Pterygota; Diptera; Brachycera; Muscomorpha;Ephydroidea; Drosophilidae; Drosophilia.
REFERENCE 1 (bases 1 to 4001)AUTHORS Padgett, R.W., St Johnston, R.D. and Gelbart, W.M.TITLE A transcript from a Drosophila pattern gene predicts a protein
homologous to the transforming growth factor-beta familyJOURNAL Nature 325, 81-84 (1987)MEDLINE 87090408
COMMENT The initiation codon could be at either 1188-1190 or 1587-1589FEATURES Location/Qualifiers
source 1..4001/organism=“Drosophila melanogaster”/db_xref=“taxon:7227”
mRNA <1..3918/gene=“dpp”/note=“decapentaplegic protein mRNA”/db_xref=“FlyBase:FBgn0000490”
gene 1..4001/note=“decapentaplegic”/gene=“dpp”/allele=“”/db_xref=“FlyBase:FBgn0000490”
CDS 1188..2954/gene=“dpp”/note=“decapentaplegic protein (1188 could be 1587)”/codon_start=1/db_xref=“FlyBase:FBgn0000490”/db_xref=“PID:g157292”/translation=“MRAWLLLLAVLATFQTIVRVASTEDISQRFIAAIAPVAAHIPLASASGSGSGRSGSRSVGASTSTALAKAFNPFSEPASFSDSDKSHRSKTNKKPSKSDANR……………………LGYDAYYCHGKCPFPLADHFNSTNAVVQTLVNNMNPGKVPKACCVPTQLDSVAMLYLNDQSTBVVLKNYQEMTBBGCGCR”
BASE COUNT 1170 a 1078 c 956 g 797 tORIGIN
1 gtcgttcaac agcgctgatc gagtttaaat ctataccgaa atgagcggcg gaaagtgagc 61 cacttggcgt gaacccaaag ctttcgagga aaattctcgg acccccatat acaaatatcg 121 gaaaaagtat cgaacagttt cgcgacgcga agcgttaaga tcgcccaaag atctccgtgc 181 ggaaacaaag aaattgaggc actattaaga gattgttgtt gtgcgcgagt gtgtgtcttc 241 agctgggtgt gtggaatgtc aactgacggg ttgtaaaggg aaaccctgaa atccgaacgg 301 ccagccaaag caaataaagc tgtgaatacg aattaagtac aacaaacagt tactgaaaca
361 gatacagatt cggattcgaa tagagaaaca gatactggag atgcccccag aaacaattca 421 attgcaaata tagtgcgttg cgcgagtgcc agtggaaaaa tatgtggatt acctgcgaac 481 cgtccgccca aggagccgcc gggtgacagg tgtatccccc aggataccaa cccgagccca 541 gaccgagatc cacatccaga tcccgaccgc agggtgccag tgtgtcatgt gccgcggcat 601 accgaccgca gccacatcta ccgaccaggt gcgcctcgaa tgcggcaaca caattttcaa ………………………….3841 aactgtataa acaaaacgta tgccctataa atatatgaat aactatctac atcgttatgc3901 gttctaagct aagctcgaat aaatccgtac acgttaatta atctagaatc gtaagaccta3961 acgcgtaagc tcagcatgtt ggataaatta atagaaacga g
//
Example of sequence database entry for SWISS-PROT
ID DECA_DROME STANDARD; PRT; 588AA.AC P07713;DT 01-APR-1988 (REL. 07, CREATED)DT 01-APR-1988 (REL. 07, LAST SEQUENCE UPDATE)DT 01-FEB-1995 (REL. 31, LAST ANNOTATION UPDATE)DE DECAPENTAPLEGIC PROTEIN PRECURSOR (DPP-C PROTEIN).GN DPP.OS DROSOPHILA MELANOGASTER (FRUIT FLY).OC EUKARYOTA; METAZOA; ARTHROPODA; INSECTA; DIPTERA.RN [1]RP SEQUENCE FROM N.A.RM 87090408RA PADGETT R.W., ST JOHNSTON R.D., GELBART W.M.;RL NATURE 325:81-84 (1987)RN [2]RP CHARACTERIZATION, AND SEQUENCE OF 457-476.RM 90258853RA PANGANIBAN G.E.F., RASHKA K.E., NEITZEL M.D., HOFFMANN F.M.;RL MOL. CELL. BIOL. 10:2669-2677(1990).CC -!- FUNCTION: DPP IS REQUIRED FOR THE PROPER DEVELOPMENT OF THECC EMBRYONIC DOORSAL HYPODERM, FOR VIABILITY OF LARVAE AND FOR CELLCC VIABILITY OF THE EPITHELIAL CELLS IN THE IMAGINAL DISKS.CC -!- SUBUNIT: HOMODIMER, DISULFIDE-LINKED.CC -!- SIMILARITY: TO OTHER GROWTH FACTORS OF THE TGF-BETA FAMILY.DR EMBL; M30116; DMDPPC.DR PIR; A26158; A26158.DR HSSP; P08112; 1TFG.DR FLYBASE; FBGN0000490; DPP.DR PROSITE; PS00250; TGF_BETA.KW GROWTH FACTOR; DIFFERENTIATION; SIGNAL.FT SIGNAL 1 ? POTENTIAL.FT PROPEP ? 456FT CHAIN 457 588 DECAPENTAPLEGIC PROTEIN.FT DISULFID 487 553 BY SIMILARITY.FT DISULFID 516 585 BY SIMILARITY.FT DISULFID 520 587 BY SIMILARITY.FT DISULFID 552 552 INTERCHAIN (BY SIMILARITY).FT CARBOHYD 120 120 POTENTIAL.FT CARBOHYD 342 342 POTENTIAL.FT CARBOHYD 377 377 POTENTIAL.FT CARBOHYD 529 529 POTENTIAL.SQ SEQUENCE 588 AA; 65850MW; 1768420 CN;
MRAWLLLLAV LATFQTIVRV ASTEDISQRF IAAIAPVAAH IPLASASGSG SGRSGSRSVGASTSTAGAKA FNRFSEPASF SDSDKSHRSK TNKKPSKSDA NRQFNEVHKP RTDQLENSKNKSKQLVNKPN HNKMAVKEQR SHHKKSHHHR SHQPKQASAS TESHQSSSIE SIFVEEPTLVLDREVASINV PANAKAIIAE QGPSTYSKEA LIKDKLKPDP STYLVEIKSL LSLFNMKRPPKIDRSKIIIP EPMKKLYAEI MGHELDSVNI PKPGLLTKSA NTVRSFTHKD SKIDDRFPHHHRFRLHFDVK SIPADEKLKA AELQLTRDAL SQQVVASRSS ANRTRYQBLV YDITRVGVRGQREPSYLLLD TKTBRLNSTD TVSLDVQPAV DRWLASPQRN YGLLVEVRTV RSLKPAPHHHVRLRRSADEA HERWQHKQPL LFTYTDDGRH DARSIRDVSG GEGGGKGGRN KRHARRPTRRKNHDDTCRRH SLYVDFSDVG WDDWIVAPLG YDAYYCHGKC PFPLADHRNS TNHAVVQTLVNNMNPGKBPK ACCBPTQLDS VAMLYLNDQS TVVLKNYQEM TVVGCGCR
Functional classification of E. coli genes according to Monica RileyI. Intermediary metabolism
A.B.C.D.E.F.
DegradationCentral intermediary metabolismRespiration (aerobic and anaerobic)FermentationATP-proton motive force interconversionsBroad regulatory functions
II. Biosynthesis of small moleculesA.B.C.D.E.F.
Amino acidsNucleotidesSugars and sugar moleculesCofactors, prosthetic groups, electron carriersFatty acids and lipidsPolyamines
III. Macromolecule metabolismA.B.
Synthesis and modificationDegradation of macromolecules
IV. Cell structureA.B.C.D.
Membrane componentsMurein sacculusSurface polysaccharides and antigensSurface structures
V. Cellular processesA.B.C.D.E.
Transport/binding proteinsCell divisionChemotaxis and mobilityProtein secretionOsmotic adaptions
VI. Other functionsA.B.C.D.E.F.G.H.
Cryptic genesPhage-related functions and prophagesColicin-related functionsPlasmid-related functionsDrug/analog sensitivityRadation sensitivityDNA sitesAdaptations to atypical conditions
The Protein Folding Problem
Protein Folding Problem(Sequence 3D Structure)
1 Protein folding is thermodynamically determined (Anfinsen’s thermodynamic principle)
Protein + Environment
2. Protein folding is a reaction imvolving other interacting molecules (Principle of molecular interactions)
Protein + Chaperonins +….
Central Paradigm
Bioinformatics : A Long Journey(How far are we away from knowing the God ??)
Sequence to exon 80% [Laub 98]Exons to gene (without cDNA or homolog) ~30% [Laub 98]Gene to regulation ~10% [Hughes 00]Regulated gene to protein sequence 98% [Gesteland ]Sequence to secondary-structure (,,c) 77% [CASP]Secondary-structure to 3D structure 25% [CASP] 3D structure to ligand specificity ~10% [Johnson 99]
Expected accuracy overall ~ = 0.8*.3*.1*.98*.77*.25*.1 = .0005 ?
Our Focus in Bioinformatics PerturbationEnvironmentMedicationGenetic Engineering
Dynamic ResponseGene ExpressionProtein Expression
BioChip
DataBaseGenotype/Phenotype
SymbolicAlgorithms/Computing
Analysis
BiologyMolecular BiologyBio ChemistryGenetics
Virtual Cell
Genome Sequencing