Top Banner
CAP 5510 Lecture 1 Introduction Su-Shing Chen, Bioinformatics CISE
65

CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Jul 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

CAP 5510Lecture 1 Introduction

� Su-Shing Chen, Bioinformatics CISE

Page 2: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

BioinformaticsWhat it is? How it is interdisciplinary?

Molecular Biology

Biochemistry Genetics

Computer Science

Page 3: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library
Page 4: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

1965 1970 1975 1980 1985 1990 1995 2000

National Library of Medicine

Now over 11 million articles in MEDLINE®

400,000 new articles added each year

Scientific literature continues to accumulate Scientific literature continues to accumulate at a rapid rateat a rapid rate

Year

Page 5: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

The Entrez search and retrieval system at NCBI

Page 6: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Three Aspects of Bioinformatics

• DNA and protein sequences: databases and BLAST searches.

• Systems biology: simulation and modeling of cells and pathways.

• Applications: drug discovery, agriculture, and medicine.

Page 7: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

DNA and Protein Sequences

� Genomics� Proteomics� Comparative Genomics� Genotypes - Phenotypes

Page 8: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Completely Sequenced GenomesGenome Size (MMbp)Est. GenesCompletedRelevanceArchaeaAeropyrum pernix K1 1.67 2694 1999 EnzymesArchaeoglobus fulgidus 2.18 2407 1997 EnzymesMethanobacterium thermoautotrophicum1.75 1869 1997 EnzymesPyrococcus abyssi 1.77 1765 1999 EnzymesPyrococcus horikoshii 1.74 2064 1998 Enzymes

BacteriaAquifex aeolicus 1.55 1522 1997 EnzymesBacillus subtilis 4.21 4100 1997 sporulating Gram positive bacteriaCampylobacter jejuni 1.64 1654 2000 Food borne pathogenChlamydia trachomatis 1.04 894 1998 Human pathogenChlamydia trachomatispneumoniae 1.23 1052 1998 Human pathogenEscherichia coli 4.64 4289 1998 Model organism, human pathogenHaemophilusinfluenzae 1.83 1709 1995 Human pathogenHelicobacter pylori 1.67 1553 1997 Stomach ulcersHelicobacter pylori J99 1.64 1491 1999 Another strainMycoplasma tuberculosis 4.41 3918 1998 TuberculosisMycoplasma genitalium 0.58 480 1995 SmallMycoplasma pneumoniae 0.82 677 1996 Walking pneumoniaRickettsia prowazekii 1.11 834 1998 Epidemic typhusSynechocystis PCC6803 3.57 3169 1996 PhotosynthesisTreponema pallidum 1.14 1031 1998 Venereal syphilisThermotoga maritima 1.86 1846 1999 EnzymesUreaplasma urealyticum 0.75 611 2000 Sexually transmitted pathogen

EukaryotaCaenorhabditis elegans 97 19000 1998 Worm-model organismSaccharomyces cerevisiae 12.07 5885 1996 Yeast-model organismHuman chromosome 22 33.46 600 1999 First fully sequenced

Page 9: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Some Estimated Genome Sizes

1005Arabidopsisthaliana

2000012Loblolly pine45012Rice250010Corn300023Human300020MouseMBPChromosomesSpecies

Page 10: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Some Key Public Databases� NCBI ENTREZ -http://www.ncbi.nlm.nih.gov/entrez/.� MEDLINE and PubMed - MEDLINE is a subdatabase

of PubMed.� SWISS-PROT -

http://www.ebi.ac.uk/ebi_docs/swissprot_db/documentation.html.

� DDBJ (DNA Data Base of Japan), http://www.ddbj.nig.ac.jp/.

� PDB (Protein Data Bank) - http://www.pdb.org, www.rcsb.org/pdb.

Page 11: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

AF010325Drosophila melanogaster CHIP (Chip)

LOCUS AF010325 3291 bp DNA INV 02-AUG-1999DEFINITION Drosophila melanogaster CHIP (Chip) gene, complete cds.ACCESSION AF010325VERSION AF010325.1 GI:2245686KEYWORDS .SOURCE fruit fly. ORGANISM Drosophila melanogaster Eukaryota; Metazoa; Arthropoda; Tracheata; Hexapoda; Insecta; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophila.REFERENCE 1 (bases 1 to 3291) AUTHORS Morcillo,P., Rosen,C., Baylies,M.K. and Dorsett,D. TITLE Chip, a widely expressed chromosomal protein required for segmentation and activity of a remote wing margin enhancer in Drosophila JOURNAL Genes Dev. 11 (20), 2729-2740 (1997) MEDLINE 97477378 PUBMED 9334334REFERENCE 2 (bases 1 to 3291) AUTHORS Morcillo,P., Rosen,C., Baylies,M.K. and Dorsett,D. TITLE Direct Submission JOURNAL Submitted (19-JUN-1997) Molecular Biology Program, Memorial Sloan-Kettering Cancer Center, 1275 York Avenue, New York, NY 10021, USAFEATURES Location/Qualifiers source 1..3291 /organism="Drosophila melanogaster" /db_xref="taxon:7227" /chromosome="2" /map="2-106.8 cM; 60B1-2" /clone="P1 Phage DS00543" mRNA join(604..677,928..2964) /gene="Chip" /product="CHIP" gene 604..2964 /gene="Chip" /allele="wild type" intron 678..927 /gene="Chip" /note="P-lacW insertion occurs at position 904 in the l(2)k04405 allele" exon 928..2964 /gene="Chip" CDS 963..2696 /gene="Chip" /note="GenBank Accession Numbers AF010326, AF010327 and AF010328 encode short forms of the CHIP protein" /codon_start=1 /product="CHIP" /protein_id="AAB62574.1"

CDSCodingRegion

NCBIModel

Page 12: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

/db_xref="GI:2245687" /translation="MNRRGLNAGNTMTSQANIDDGSWKAVSEGGSMLPASNSAVLNPD GSNQSGFAQGGLPYNSAGNPYPPAGQSSPAGNQSIVFQNSNQPGSNTPQYTSSPAPSG SSTPGPVGAQNIPGNYPQSATAGNFNGPVGGPFGSPSSGLGQFSRPASSGTPFNSGQA GHFSSPTVFSVGGQFNPMPPASPFGHGGNHPMMGGPQQMERIDQGFRRHNSYFSHTEH RVFELNKRLQQRNEESDNCWWDSFTTEFFEDDARLTILFCLEDGPKRYTIGRTLIPRF FRSIYEGGVSDLYFQLKHAKESFHNTSITLDCDQCTVITQHGKPFFTKVCADARLILE FMYDDYMRIKSWHMTIKGHRELIPRSVIGTSLPPDPMLLDQITKNITRAGITNSTLNY LRLCVILEPMQELMSRHKAYALSPRDCLKTTLFQKWQRMVAPPGKKDPQRPPNKRRKR KGSNSGGGNNSNTPPVTNQKRSPSGPSFSLSSQDVMVVGEPTLMGGEFGEEDERLITR LENTQYDGTNAVEHDNHTGFGHADSPISGSNPWSIDRAGAIPASPGNGAAPQNNANIS DIDKKSPIVSQ" polyA_site 2964 /gene="Chip" BASE COUNT 898 a 870 c 773 g 749 t 1 others ORIGIN 1 aaaatatgtt taccattcaa cgacactnga agatgtgcga aattaatgca gtttataaat 61 aacataccta aacacgtact ctagtatata aaaatcgaac actttcgcat cacaccacgg 121 gcttgcagtg tcacaggcca aaattacata aatataaaat aaactaccga cgttaaagca 181 acagtcttac aaccatattt tgattacaaa tttttacggc acggcacata ctaaattatt 241 aagtcaaggg tgcggcgtgg gacaaagtga aacggaatgg aaatggaatc tgcttgagtc 301 ggtcgatagg tttttcttta caattaatca attatttaga gggtatatgt ggcaattgat 361 gtacaatacg ttaagggtgg tatgagattt cgaagaaact aaaaacaaac tatgagaaac 421 agtcttgaat cgttagtttt atttaacgtc atacaaaaaa tttgttcaag aacatgaaaa 481 ccctccaggc gataggtcgt ttttccgttc attccaaacg cgagacaaac ccacgcttgc 541 gttctcacct atcgaaatct gctatttgga aagcgatggc aaaactatcg gtggaggcca 601 atcagttttc ttaaattgta gatacattta tatgtgtttg tgctttaaga attaaaatat 661 tgtatgcctt ttgcaaggtg cgtatacggg agcagtctgc ctgcactgaa atccggccga 721 tccggccctt cgcagcgttt tccatttcga acgcgaacgg gccgtgaaaa gtgtgtgtgt 781 gtacacacat attttgtaaa tgcagtgcga catacagaca agtgcataca catacgcatg 841 ggcaccataa aaatataccg aacccgaaac ccccaaaacg ccgaaataat aatgggttca 901 aactaatgga ttcctctcca ttcacaggcc gacgcgaatg cccgatagaa ccgaccacag 961 gcatgaatcg taggggtttg aacgctggca atacgatgac ttctcaagcg aacattgacg 1021 acggcagctg gaaggcggtc tccgagggcg gatcaatgct gcccgcgtcc aattcggcgg 1081 tcctcaatcc ggacgggagc aaccagagcg gcttcgcgca gggcggcctg ccgtacaact 1141 cggccgggaa tccgtacccg cccgccggcc aatcctcgcc agccggcaac cagtccatcg 1201 tgttccagaa ctccaaccag cctggctcga acacacctca gtatacctcc tcacctgctc 1261 cctcgggctc atcgacaccc ggacctgttg gtgcgcagaa cattcccggc aactacccgc 1321 agtcggcgac ggcgggcaat ttcaatggtc cagttggcgg gcccttcggc tcgccatcct 1381 cgggactcgg ccagttcagc cggccggcca gctcgggcac tccgtttaac agtggccagg 1441 ccgggcactt ctcatcgccc acggtattca gcgttggcgg gcaattcaat ccgatgccgc 1501 cggcgtctcc attcggccac ggaggcaacc acccgatgat gggcgggccg cagcagatgg 1561 aacgcatcga ccagggcttc aggcggcaca attcctactt tagccacacg gaacaccgcg 1621 tctttgagct aaacaagcgg ctgcagcagc gcaacgagga gagcgacaac tgctggtggg 1681 actcgtttac cacggagttc ttcgaggatg atgcccggct gaccattctg ttctgcctgg 1741 aggacggacc gaagcggtac accatcgggc gcacgctcat cccgcgcttc ttccgcagca 1801 tatacgaggg cggcgtttcg gacctgtact ttcagctgaa gcatgccaag gagtcgttcc 1861 acaacacgtc catcacgttg gactgcgacc agtgcacggt gatcacgcag cacggcaagc 1921 ccttcttcac gaaggtctgc gccgacgcaa gactgatctt ggagttcatg tacgacgact 1981 acatgcgcat caagtcatgg cacatgacca tcaagggaca ccgcgagctc attccaaggt 2041 ccgtgatcgg caccagtctg ccgcctgacc cgatgctgct agatcaaata accaagaaca 2101 tcacgcgcgc tggcatcacc aactccaccc tcaactatct gcgcctctgc gtcatcctcg 2161 aaccgatgca ggagctgatg tcgcggcaca aggcgtacgc actgagtccg cgcgactgcc 2221 tgaagacaac gttgttccag aagtggcagc gcatggtggc tccgcccggc aagaaggatc

DNA-

centered

Page 13: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

2281 cgcaacgacc acccaacaag aggcgcaaac ggaagggctc caactcgggt ggtggcaaca 2341 actcgaacac tccgcctgtg acgaaccaga agcgttcccc ctccggtccc agtttctccc 2401 tctcctccca ggacgtgatg gtggtgggcg agccgacgct gatgggcggc gagtttggtg 2461 aggaggacga gcggctgatc acccggctag agaacacgca atacgacgga accaatgccg 2521 tggagcacga taatcacacc ggctttggac acgccgactc gcccatatcc ggctcgaatc 2581 cgtggagcat cgaccgggcg ggagccatcc cggccagccc tgggaacgga gctgccccgc 2641 agaacaacgc gaatatatct gacatagata aaaagagccc cattgtatcg caatagaact 2701 taataaaaca taatatttcg ctatattgtc aaatgtattt tcatacttgc atgtaaaaat 2761 atttaaataa agctttcaag ttttaggaaa tgtatataag acatacatac atattaaatc 2821 tatataattg taccttaagt cgtctactct acacactatc tattataatt aattaaaccg 2881 taggatactt ctgttcttgt cttgactgcc ttgcaaagta cttaatgtgt tgttttagta 2941 aaacataact tttaagcaca acgcaaaaac attgaaagct ctttatttac cttcttatcg 3001 catggaacag caggtgctga cttactttct gggctgccca ttgcaaattg ggtacaaaaa 3061 tgctttgaac atatctcaga tgcttacaac tgaatagaat accaagatag ccaatttgaa 3121 ggaatgctat catatacgaa agattgccat gtatatagag tacgtggcat ctcatcagat 3181 attcggattg taatcacagt gcttagcaat cattaccctt tccgtgattt tgcgaagtcc 3241 agctcctttg cgcatgcacg aatatctgcg ggatacagat acacagtttc g//

Page 14: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

NCBI Format: ASN.1� Abstract Syntax Notation 1 (ASN.1) -

ISO (International Standards Organization)

� A common format for computers and communications

� GenBank flatfile - human readable format

� Publications - annotation of function and context

� Identifiers - Accession Number

Page 15: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

The NCBI Data Elements� Bibliographic citations� DNA sequences� Protein sequences� 3-D structures� Seq-Ids-DDBJ/EMBL/GenBank, RefSeq,

General Seq-id, Local Seq-id� BIOSEQs, BIOSEQ –SETs - Genome

maps� SEQ-ANNOT – Alignments� SEQ-DESCR – Taxonomy(See Textbook Page 31)

Page 16: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

GenBank growth (Number of base pairs) August, 2000: 9545724824 ~ 10 BillionOctober, 2000: 10335692655

0.00E+00

2.00E+09

4.00E+09

6.00E+09

8.00E+09

1.00E+10

1.20E+10

Dec

-92

Apr

-93

Aug

-93

Dec

-93

Apr

-94

Aug

-94

Dec

-94

Apr

-95

Aug

-95

Dec

-95

Apr

-96

Aug

-96

Dec

-96

Apr

-97

Aug

-97

Dec

-97

Apr

-98

Aug

-98

Dec

-98

Apr

-99

Aug

-99

Dec

-99

Apr

-00

Aug

-00

Page 17: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

������������� ������������

��������� ���� ���������������������� ����� ���� ���

��������� ���� ������������������ ������ ��� �

��������� ���� ����� ������������ ������ ���������– �������������� ������� ����������������������������������������� ��� ��� ������!

Page 18: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

"#$����"����#����$����� �����������

$��� ������������ ����������������� ����������%&$������� ������������ ������!

������� ����������������� � ����� ������ ��������!

Page 19: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library
Page 20: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Data Type: One-dimensional Strings

DNA (DeoxyriboNucleic Acid): {A, C, G, T}

RNA (RiboNucleic Acid): {A, C, G, U}

Protein (Cellular Macromolecules; enzymes, hormones, antibodies) : {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}

Page 21: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Ala: Alanine Cys: Cysteine Asp: Aspartic acid Glu: Glutamic acid Phe: Phenylalanine Gly: Glycine His: Histidine Ile: Isoleucine Lys: Lysine Leu: Leucine Met: Methionine Asn: Asparagine Pro: Proline Gln: Glutamine Arg: Arginine Ser: Serine Thr: Threonine Val: Valine Trp: Tryptophane Tyr: Tyrosisne

A = adenine (adenosine) G = guanine (guanosine) C = cytosine (cytidine) T = thymine U = uracilAT, GC (complementary principle)

[A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y] ={Ala, Cys, Asp, Glu, Phe, Gly, His, Ile, Lys, Leu, Met, Asn, Pro, Gln, Arg, Ser, Thr, Val, Trp, Tvr}

Page 22: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

THEHOLYGRAIL

Page 23: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Most of Bioinformatics about is databasing!

Let us look at some databases!

Page 24: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library
Page 25: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

PHENOTYPES

Page 26: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library
Page 27: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library
Page 28: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Accepted Scientific Name

Synonyms

Common names

Taxonomic scrutiny Source database

Optional/comment fieldFamily

Standard Metadata Attributes - Global Species Database

Full name|Species2000Status|Status|Nomenclatural reference|List of acceptance status referencesGenus|Epithet|Authorstring Authorstring|Year|Title|Source

Full name|Status|Nomenclatural reference|List of synonymic status references

Common name|Country|Language|List of source references (Red oak=Quercus rubra; human being=Homo sapiens)

Personstring|Date Database short name|Version number|Date|URL

Family name of this speciesOrganism|Family|Habit/Life form

Page 29: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Systems Biology:Simulation and Modeling

Functional Genomics, Proteomics, Signal Pathways, Cells, Organisms

Page 30: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

System-Level Issues� Components: gene sequences,

protein structures, gene and protein functions, signal pathways.

� Interactions: gene and protein, genotype-phenotype, and functions.

� Knowledge representation: simulation and modeling

Page 31: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

The Objective: building functional gene & protein networks?� Traditional approach: (1) similarity of DNA &

amino acid sequences of known functions, (2) genetic and biochemical analysis

� New approach: associative inference of functionally-related properties in 3D structural complex, metabolic pathway, signal pathway, biological process, and physiological function

Page 32: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

.

.

.

Indexingof LinearGene and Protein Sequences

Relationsof NonlinearPlant & Animal Traits

A Nonlinear Model of Functional Genomics

TransgenicManipulation

Page 33: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Central Dogma of Biology

How genes and proteins are related?What are coding regions?

Page 34: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Human Genome Project Progress

65.7%Draft

Finished 21.1%

Total 86.8%

Jun 25, 2000

Genome Watch

Finished, high quality sequence: Goal: 100% by 2003

21.1% (752,050,000 bases)(27/June/00)

Page 35: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Locus Information� A locus is often a gene,

characterized by a mutant phenotype or by a DNA sequence, which has been either genetically mapped or localized (DNA sequence comparison or hybridization) to a particular spot in a genome.

Page 36: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

ORF (Open Reading Frame)� An ORF corresponds to a stretch of DNA

that can be translated into a polypeptide. It begins with an ATG start codon and terminates with one of the 3’ stop codons.

� An ORF is a stretch of DNA that codes a protein of 1000 amino acids or more.

� An ORF is not considered equivalent to a gene or locus until it has a phenotype associated with a mutation in the ORF and/or an mRNA transcript or a gene product generated.

Page 37: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library
Page 38: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Genetics overview ICentral Dogma of Biology

DNA RNA PROTEIN transcription translation

Only some DNA is transcribed(~70% of human genome is extragenic; ~92%not mRNA)

Only some mRNA is translated(from 4-letter alphabet to 22 letter alphabet);about 70%, but variance is large

Page 39: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Genetics overview II

A gene is the functional and physical unit of heredity passed from parent tooffspring. Genes are

pieces of DNA, and most genes contain the information for making a specific protein.

Gene expression is a highly specific process in which a gene is switched on at a certain time and begins production of its protein.

Entire gene is transcribed(~60kb), then spliced, then exported (~2kb)

Page 40: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Genetics overview IIITranslation of mRNAinto protein is done by the ribosome; tRNA within the ribosome reads a single 3-base codon, adds a single aminoacid, and leaves the ribosome to make room for the next tRNA.Protein structure determination is done via crystallography orNMR. Protein sequencing is slowand expensive vs DNA sequencing

Page 41: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

The Big ProblemNetworks of Transcription Factors

Page 42: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

DNA

PROMOTERTATA-Box:TATAAAA

DPE - DOWNSTREAM PROMOTER ELEMENTTATA-less

START TRANSCRIPTIONSITE

GeneralTranscription

Factors

RNAPolymerase II

TRANSCRIPTIONALCONTROL

Cell TypesGTFsDNA StructuresChromosomal Structure

Page 43: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Viruses are different!

Their genomes use RNA’s and reverse transcription!

Page 44: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

The HIV Virus

Page 45: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Reverse Transcription: Converting viral RNA into DNA

� An enzyme (protein) that's part of the human immunodeficiency virus reads the sequence of viral RNA nucleic acids that have entered the host cell and transcribes the sequence into a complementary DNA sequence. That enzyme is called "reverse transcriptase". Without reverse transcriptase, the viral genome couldn't become incorporated into the host cell, and couldn't reproduce.

Page 46: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

A Major Problem: Gene Annotation (Our Midterm Project)� Genes are regions of exon regions (Craig

Venter)� Human genome (~3 billion bps) needs

annotation� Gene finding: Exon and intron regions� Other functional regions

Page 47: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Finding Gene Functions?Functional Genomics

� DNA Metabolism Function� Molecular Function� Cell Function� www.geneontology.org� www.fruitfly.org� genome-www.stanford.edu/Saccharomyces/

Page 48: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Computational Biology

It is somewhat different from Bioinformatics. It is more computational, while bioinformatics is more informational.

Page 49: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

> DNA sequenceAATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACACTGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAATCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTAACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGGTTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAATTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTGGTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGACGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGCTACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGAACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGGTAAGAAGATCGCGAACATCTAGTAGA

Gene

> Protein sequenceMKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNIDELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGKKVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNEPDEAEQDCIEFGKKIANI

Biological structure &function

The power of computing on the data

Computational Biology: Performing biological Computational Biology: Performing biological experiments experiments in in silicosilico

Page 50: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Similarity to bacterial and

yeast genes sheds new light on human disease process

Human 638 RHACVEVQDEIAFIPNDVYFEKDKQMFHIITGPNMGGKSTYIRQTGVIVLMAQIGCFVPC 697Yeast 657 RHPVLEMQDDISFISNDVTLESGKGDFLIITGPNMGGKSTYIRQVGVISLMAQIGCFVPC 716E.coli 584 RHPVVEQVLNEPFIANPLNLSPQRR-MLIITGPNMGGKSTYMRQTALIALMAYIGSYVPA 642

portion of DNA mismatch repair protein sequence

Comparative Analysis of Comparative Analysis of GenesGenesCell, Vol. 75, 1027-1038, December 3, 1993, Copyright © 1993 by Cell Press

The Human Mutator Gene Homolog MSH2

and its Association with Hereditary

Nonpolyposis Colon Cancer

Richard Fishel, * Mary Kay Lescoe, * M. R. S. Rao,

�� ��

Neal G. Copeland, †Nancy Jenkins, †Judy Garber, ‡ Michael Kane,

�� ��

and Richard Kolodner

�� ��

*Department of Microbiology and Molecular Genetics

Markey Center for Molecular Genetics

University of Vermont Medical School

can give rise to mismatched bases

example, the deamination of 5-

thymine and and, therefore, a G

1980). Second, misincorporation

DNA replication

Page 51: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

mRNAs

Intracellular signalsIntercellular signals

Proximalnetwork

DNA

Protein kinases& phosphatases

Synthesis enzymes & peptide hormones

Receptorproteins

Substrateproteins

Roger Smorgyi

Page 52: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Microarray is Popular, but not MagicIt is hard to do data mining. We still need careful lab work in evolution and biochemistry.

Page 53: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Microarray analysis

Microarrays are mRNAs Spotted at high density ontoglass chips. Expression of thousands of genes over hundreds of cell states is measured.

Identifying coregulatedgenes is not so simple.Many physicists workin this field-- one recentpublication:Super-paramagnetic clustering of data , Eytan Domany, Physica A 263, 158 (1999)

Page 54: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library
Page 55: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Functional Gene Networks� Large amount of gene expression

data � Gene network inference � Global biochemical models -

functions and regulations� Boolean and neural networks� Functional genomics: Gene

expression mappings

Page 56: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

From expression data to gene network

Page 57: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Functional Proteomics: Functional Protein Networks

� Experimental interactions� Linking of metabolic pathway neighbors� Calculation of correlated evolution� Calculation of correlated mRNA expressions� Calculation of domain fusion� Calculation of protein profiles

Page 58: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Computed prediction of leptin’s 3D structure

The protein sequence of leptin is compatible with the protein structure of interleukin-2 (IL-2),

suggesting that the two may have a similar mechanisms of action

IL-2 structure / IL-2 sequence IL-2 structure / leptin sequence

Page 59: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Metabolic and Signal Transduction Pathways

� Time and space snapshots of gene and protein expressions

� Regulatory mechanisms - kinetic logic and feedback circuits

� Signal pathway simulators (GEPASI)

� Knowledge pathway builders (POOLS)

Page 60: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Metabolic pathways:Enzyme-catalyzed systems(energy and nutrients -> building blocks: Macromolecules, proteins, nucleic acids)

A -> B -> C -> …. MetaboliesE1 E2 E3 Enzymes

Page 61: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

phenylalanine cinnamate 4-coumarate A

B

C

A 4-coumaroyl-CoA+3 malonyl-CoA chalcone flaanone

flavonol flavonol glycoside

leucoanthocyanidins anthocyanidins anthocyanins glutathione-conjugated anthocyanins

B dihydroflavonol

flavan-4-ol phlobaphenes, 3-deoxyanthocyanins

C-glycosyl flavanone C-glycosyl flavones(e.g., maysin)

C

Page 62: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Agri-ecosystem

Biosphere

Resistant maize hybrids

Maizeearworm

Cropsdamage

MaysinC-glycasyl

Flavonoid pathway

Genotypes

Functional Protein Network

Page 63: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

KEGG Database

Page 64: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

E Cell Simulation

Page 65: CAP 5510 Lecture 1 Introductionsuchen/cap5510fall/ChenLecture...0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 1965 1970 1975 1980 1985 1990 1995 2000 National Library

Conclusions� Bioinformatics plays a key role in

thr 21st Century science and technology.

� Computational paradigms are non-traditioinal.

� Bioinformatics must be practical to biotechnology.