BioInformatics (1). What is Life All About : Self-compiling & self-assembling Complementary surfaces Watson-Crick base pair (Nature April 25, 1953)

BioInformatics (1)

What is Life All About :Self-compiling & self-assembling

Complementary surfacesWatson-Crick base pair (Nature April 25, 1953)

Life Science vs ComputingWhere do parasites come from?

(computer & biological viral codes)

Over $12 billion/year on computer viruses

LoveBugSet dirtemp =3D fso.GetSpecialFolder(2)Set c =3D fso.GetFile(WScript.ScriptFullName)c.Copy(dirsystem&"\MSKernel32.vbs")c.Copy(dirwin&"\Win32DLL.vbs")c.Copy(dirsystem&"\LOVE-LETTER-FOR-YOU.TXT.vbs")regruns()html()spreadtoemail()listadriv()

20 M dead (worse thanblack plague & 1918 Flu)

AIDS - HIV-1 Polymerase drug resistance mutations

M41L, D67N, T69D, L210W, T215Y, H208Y PISPIETVPV KLKPGMDGPK VKQWPLTEEK

IKALIEICAE LEKDGKISKI GPVNPYDTPV

FAIKKKNSDK WRKLVDFREL NKRTQDFCEV

Concept Computers Organisms

Instructions Program GenomeBits 0,1 a,c,g,tStable memory ROM,Disk,tape DNAActive memory RAM RNAProcessing CPU/Compiler enzyme/RibosomeEditing Editor tRNAEnvironment Sockets,people Water,salts,heatI/O AD/DA proteinsMonomer Minerals NucleotidePolymer chip DNA,RNA,proteinReplication Cut/Paste DNA replicationSensor/In scanner Chem/photo receptor

Exciting Life ??

of RNA-based life: C,H,N,O,P Useful for many species:Na, K, Fe, Cl, Ca, Mg, Mo, Mn, S, Se, Cu, Ni, Co, Si

Elements

The Four Nucleosides of DNA

dA dG dC dT

A nucleoside is a sugar, here deoxyribose, plus a base

dA = deoxyadenosine, etc.

PYRIMIDINESPURINES

AdenineGuanine

Thymine Cytosine Uracil

BASES

Base Pairing

A nucleotide is a phospate, a sugar, and a purine or a pyramidine base.

The monomeric units of nucleic acids are called nucleotides.

Chromosomes

Genome and gene

Entity Definition Molecular Mechanisms Genome Unit of information transmission DNA replication

Gene Unit of information expression (a special sequence of nucleotide bases, whose sequences carry the information required for constructing protein)

Transcription to RNA Translation to protein

Nucleic acid and proteins

Macromolecule Backbone Repeating unit Length Role

DNA Phosphodiester bonds Deoxyribonucleotides (A, C, G, T)

103-108 Genome Nucleic acid RNA Phosphodiester bonds Ribonucleotides

(A, C, G, U) 103-105 103-104 102-103

Genome Messenger Gene product

Protein ( structure components of cells/tissues/enzymes)

Peptide bonds Amino acids (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y)

102-103 Gene product

Nucleotide codes

A Adenine W Weak (A or T)

G Guanine S Strong (G or C)

C Cytosine M Amino (A or C)

T Thymine K Keto (G or T)

U Uracil B Not A (G or C or T)

R Purine (A or G) H Not G (A or C or T)

Y Pyrimidine (C or T) D Not C (A or G or T)

N Any nucleotide V Not T (A or G or C)

Amino acid codes

AlaArgAsnAspCysGlnGluGlyHisIleLeuLysMetPheProSerThrTrpTyrVa lAsxGlxSecUnk

ARNDCQEGHILKMFPSTWYVBZUX

AlanineArginineAsparagineAspartic acidCysteineGlutamineGlutamic acidGlycineHistidineIsoleucineLeucineLysineMethioninePhenylalanineProlineSerineThreonineTryptophanTyrosineVa lineAsn or AspGln or GluSelenocysteineUnknown

StandardGenetic

Code

Schematic illustration of a plant cell(Home for DNA)

History of structure determination for nucleic acids and proteinsTechnology development Structure determination

195049 Edman degradation

54 Isomorphous replacement

-helix model

53 DNA double helix modelInsulin primary structure

1960

62 Restriction enzyme

60 Myoglobin tertiary structure

65 tRNAAla primary structure

1970

72 DNA cloning

75 DNA sequencing

73 tRNAPhe tertiary structure

77 X174 complete genome

79 Z-DNA by single crystal differentiation1980

84 Pulse field gel electrophoresis85 Polymerase chain reaction

87 YAC vec tor86 Protein structure by 2D NMR

88 Human Genome Project

1990

93 DNA chip

95 H influenzae complete genome

2000

Human chromosomes: idiograms

X-linked recessive disorder. The inheritance pattern is shown for a recessive gene on the chromosome X, designated in bold.

MaleXY

(normal)

FemaleXX

(normal)

Female XX(normal)

Female XX(normal)

Male XY(normal)

Male XY(affected)

Reductionistic and synthetic approaches in biology

Biological System

(Organism)

Building Blocks

(Genes/Molecules)

Synthetic

Approach

(Bioinformatics)

Reductionistic

Approach

(Experiments)

Basic principles in physics, chemistry and biology.

Principles Known?

Physics

Matter

Chemistry

Compound

Biology

Organism

ElementaryParticles

Yes

Elements

Yes

Genes

No

100 000

10 000

1000

100

101

0.1

0.01

1965 1970 1975 1980 1985 1990 1995

MEDLINE G5 MeSH

2000

Year

Am

ount

(x1

000)

0.001

Transistors / chipDNA sequencesMapped human genes3-D structures

MEDLINE records

The addresses for the major databases

Database Organization Address

MEDLINE National Library of Medicine www.nlm.nih.gov

GenBank National Center for Biotechnology Information www.ncbi.nlm.nih.gov

EMBL European Bioinformatics Institute www.ebi.ac.uk

DDBJ National Institute of Genetics, Japan www.ddbj.nig.ac.jp

SWISS-PROT Swiss Institute of Bioinformatics www.expasy.ch

PIR National Biomedical Research Foundation www-nbrf.georgetown.edu

PRF Protein Research Foundation, Japan www.prf.or.jp

PDB Research Collaboratory for Structural Bioinformatics www.rcsb.org

CSD Cambridge Crystallographic Data Centre www.ccdc.cam.ac.uk

New generation of molecular biology databases

Information Database Address

Compounds and reactions LIGANDAaindex

www.genome.ad.jp/dbget/ligand.htmlwww.genome.ad.jp/dbget/aaindex.html

Protein families andsequence motifs

PROSITEBlocksPRINTSPfamPro Dom

www.expasy.ch/sprot/prosite.htmlwww.blocks.fhcrc.org/www.biochem.ucl.ac.uk.bsm.dbbrowser/PRINTS/www.sanger.ac.uk/Pfam/,pfam.wustl.edu/protein.toulouse.inra.fr/prodom.html

3D fold classifications SCOPCATH

scop.mrc-lmb.cam.ac.uk/scop/www.biochem.ucl.ac.uk/bsm/cath/

Orthologous genes COGKEGG

www.ncbi.nlm.nih.gov/COG/www.genome.ad.jp/kegg/

Biochemical pathways KEGGWITEcoCycUM-BBD

www.genome.ad.jp/kegg/www.mcs.anl.gov/WIT2/ecocyc.PangeaSystems.com/ecocyc/www.labmed.umn.edu/umbbd/

Genome diversity NCBI TaxonomyOMIM

www.ncbi.nlm.nih.gov/Taxonomy/www.ncbi.nlm.nih.gov/Omim/

Example of sequence database entry for GenbankLOCUS DRODPPC 4001 bp INV 15-MAR-1990DEFINITION D.melanogaster decapentaplegic gene complex (DPP-C), complete cds.ACCESSION M30116KEYWORDS .SOURCE D.melanogaster, cDNA to mRNA.

ORGANISM Drosophila melanogasterEurkaryote; mitochondrial eukaryotes; Metazoa; Arthropoda;Tracheata; Insecta; Pterygota; Diptera; Brachycera; Muscomorpha;Ephydroidea; Drosophilidae; Drosophilia.

REFERENCE 1 (bases 1 to 4001)AUTHORS Padgett, R.W., St Johnston, R.D. and Gelbart, W.M.TITLE A transcript from a Drosophila pattern gene predicts a protein

homologous to the transforming growth factor-beta familyJOURNAL Nature 325, 81-84 (1987)MEDLINE 87090408

COMMENT The initiation codon could be at either 1188-1190 or 1587-1589FEATURES Location/Qualifiers

source 1..4001/organism=“Drosophila melanogaster”/db_xref=“taxon:7227”

mRNA <1..3918/gene=“dpp”/note=“decapentaplegic protein mRNA”/db_xref=“FlyBase:FBgn0000490”

gene 1..4001/note=“decapentaplegic”/gene=“dpp”/allele=“”/db_xref=“FlyBase:FBgn0000490”

CDS 1188..2954/gene=“dpp”/note=“decapentaplegic protein (1188 could be 1587)”/codon_start=1/db_xref=“FlyBase:FBgn0000490”/db_xref=“PID:g157292”/translation=“MRAWLLLLAVLATFQTIVRVASTEDISQRFIAAIAPVAAHIPLASASGSGSGRSGSRSVGASTSTALAKAFNPFSEPASFSDSDKSHRSKTNKKPSKSDANR……………………LGYDAYYCHGKCPFPLADHFNSTNAVVQTLVNNMNPGKVPKACCVPTQLDSVAMLYLNDQSTBVVLKNYQEMTBBGCGCR”

BASE COUNT 1170 a 1078 c 956 g 797 tORIGIN

1 gtcgttcaac agcgctgatc gagtttaaat ctataccgaa atgagcggcg gaaagtgagc 61 cacttggcgt gaacccaaag ctttcgagga aaattctcgg acccccatat acaaatatcg 121 gaaaaagtat cgaacagttt cgcgacgcga agcgttaaga tcgcccaaag atctccgtgc 181 ggaaacaaag aaattgaggc actattaaga gattgttgtt gtgcgcgagt gtgtgtcttc 241 agctgggtgt gtggaatgtc aactgacggg ttgtaaaggg aaaccctgaa atccgaacgg 301 ccagccaaag caaataaagc tgtgaatacg aattaagtac aacaaacagt tactgaaaca

361 gatacagatt cggattcgaa tagagaaaca gatactggag atgcccccag aaacaattca 421 attgcaaata tagtgcgttg cgcgagtgcc agtggaaaaa tatgtggatt acctgcgaac 481 cgtccgccca aggagccgcc gggtgacagg tgtatccccc aggataccaa cccgagccca 541 gaccgagatc cacatccaga tcccgaccgc agggtgccag tgtgtcatgt gccgcggcat 601 accgaccgca gccacatcta ccgaccaggt gcgcctcgaa tgcggcaaca caattttcaa ………………………….3841 aactgtataa acaaaacgta tgccctataa atatatgaat aactatctac atcgttatgc3901 gttctaagct aagctcgaat aaatccgtac acgttaatta atctagaatc gtaagaccta3961 acgcgtaagc tcagcatgtt ggataaatta atagaaacga g

//

Example of sequence database entry for SWISS-PROT

ID DECA_DROME STANDARD; PRT; 588AA.AC P07713;DT 01-APR-1988 (REL. 07, CREATED)DT 01-APR-1988 (REL. 07, LAST SEQUENCE UPDATE)DT 01-FEB-1995 (REL. 31, LAST ANNOTATION UPDATE)DE DECAPENTAPLEGIC PROTEIN PRECURSOR (DPP-C PROTEIN).GN DPP.OS DROSOPHILA MELANOGASTER (FRUIT FLY).OC EUKARYOTA; METAZOA; ARTHROPODA; INSECTA; DIPTERA.RN [1]RP SEQUENCE FROM N.A.RM 87090408RA PADGETT R.W., ST JOHNSTON R.D., GELBART W.M.;RL NATURE 325:81-84 (1987)RN [2]RP CHARACTERIZATION, AND SEQUENCE OF 457-476.RM 90258853RA PANGANIBAN G.E.F., RASHKA K.E., NEITZEL M.D., HOFFMANN F.M.;RL MOL. CELL. BIOL. 10:2669-2677(1990).CC -!- FUNCTION: DPP IS REQUIRED FOR THE PROPER DEVELOPMENT OF THECC EMBRYONIC DOORSAL HYPODERM, FOR VIABILITY OF LARVAE AND FOR CELLCC VIABILITY OF THE EPITHELIAL CELLS IN THE IMAGINAL DISKS.CC -!- SUBUNIT: HOMODIMER, DISULFIDE-LINKED.CC -!- SIMILARITY: TO OTHER GROWTH FACTORS OF THE TGF-BETA FAMILY.DR EMBL; M30116; DMDPPC.DR PIR; A26158; A26158.DR HSSP; P08112; 1TFG.DR FLYBASE; FBGN0000490; DPP.DR PROSITE; PS00250; TGF_BETA.KW GROWTH FACTOR; DIFFERENTIATION; SIGNAL.FT SIGNAL 1 ? POTENTIAL.FT PROPEP ? 456FT CHAIN 457 588 DECAPENTAPLEGIC PROTEIN.FT DISULFID 487 553 BY SIMILARITY.FT DISULFID 516 585 BY SIMILARITY.FT DISULFID 520 587 BY SIMILARITY.FT DISULFID 552 552 INTERCHAIN (BY SIMILARITY).FT CARBOHYD 120 120 POTENTIAL.FT CARBOHYD 342 342 POTENTIAL.FT CARBOHYD 377 377 POTENTIAL.FT CARBOHYD 529 529 POTENTIAL.SQ SEQUENCE 588 AA; 65850MW; 1768420 CN;

MRAWLLLLAV LATFQTIVRV ASTEDISQRF IAAIAPVAAH IPLASASGSG SGRSGSRSVGASTSTAGAKA FNRFSEPASF SDSDKSHRSK TNKKPSKSDA NRQFNEVHKP RTDQLENSKNKSKQLVNKPN HNKMAVKEQR SHHKKSHHHR SHQPKQASAS TESHQSSSIE SIFVEEPTLVLDREVASINV PANAKAIIAE QGPSTYSKEA LIKDKLKPDP STYLVEIKSL LSLFNMKRPPKIDRSKIIIP EPMKKLYAEI MGHELDSVNI PKPGLLTKSA NTVRSFTHKD SKIDDRFPHHHRFRLHFDVK SIPADEKLKA AELQLTRDAL SQQVVASRSS ANRTRYQBLV YDITRVGVRGQREPSYLLLD TKTBRLNSTD TVSLDVQPAV DRWLASPQRN YGLLVEVRTV RSLKPAPHHHVRLRRSADEA HERWQHKQPL LFTYTDDGRH DARSIRDVSG GEGGGKGGRN KRHARRPTRRKNHDDTCRRH SLYVDFSDVG WDDWIVAPLG YDAYYCHGKC PFPLADHRNS TNHAVVQTLVNNMNPGKBPK ACCBPTQLDS VAMLYLNDQS TVVLKNYQEM TVVGCGCR

Functional classification of E. coli genes according to Monica RileyI. Intermediary metabolism

A.B.C.D.E.F.

DegradationCentral intermediary metabolismRespiration (aerobic and anaerobic)FermentationATP-proton motive force interconversionsBroad regulatory functions

II. Biosynthesis of small moleculesA.B.C.D.E.F.

Amino acidsNucleotidesSugars and sugar moleculesCofactors, prosthetic groups, electron carriersFatty acids and lipidsPolyamines

III. Macromolecule metabolismA.B.

Synthesis and modificationDegradation of macromolecules

IV. Cell structureA.B.C.D.

Membrane componentsMurein sacculusSurface polysaccharides and antigensSurface structures

V. Cellular processesA.B.C.D.E.

Transport/binding proteinsCell divisionChemotaxis and mobilityProtein secretionOsmotic adaptions

VI. Other functionsA.B.C.D.E.F.G.H.

Cryptic genesPhage-related functions and prophagesColicin-related functionsPlasmid-related functionsDrug/analog sensitivityRadation sensitivityDNA sitesAdaptations to atypical conditions

The Protein Folding Problem

Protein Folding Problem(Sequence 3D Structure)

1 Protein folding is thermodynamically determined (Anfinsen’s thermodynamic principle)

Protein + Environment

2. Protein folding is a reaction imvolving other interacting molecules (Principle of molecular interactions)

Protein + Chaperonins +….

Central Paradigm

Bioinformatics : A Long Journey(How far are we away from knowing the God ??)

Sequence to exon 80% [Laub 98]Exons to gene (without cDNA or homolog) ~30% [Laub 98]Gene to regulation ~10% [Hughes 00]Regulated gene to protein sequence 98% [Gesteland ]Sequence to secondary-structure (,,c) 77% [CASP]Secondary-structure to 3D structure 25% [CASP] 3D structure to ligand specificity ~10% [Johnson 99]

Expected accuracy overall ~ = 0.8*.3*.1*.98*.77*.25*.1 = .0005 ?

Our Focus in Bioinformatics PerturbationEnvironmentMedicationGenetic Engineering

Dynamic ResponseGene ExpressionProtein Expression

BioChip

DataBaseGenotype/Phenotype

SymbolicAlgorithms/Computing

Analysis

BiologyMolecular BiologyBio ChemistryGenetics

Virtual Cell

Genome Sequencing

BioInformatics (1). What is Life All About : Self-compiling & self-assembling Complementary surfaces Watson-Crick base pair (Nature April 25, 1953)

Documents

dna slide

gene slide

pyrimidinespurines slide

idiograms slide

base pairing slide

major databases slide

standard genetic code

normal male xy normal