Computational Biology and High Performance Computing/67531/metadc709078/m2/1/high_re… · Computational Biology and High Performance Computing Manfred Zorn, Teresa Head-Gordon, Adam

LBNL-44460

ERNEST ORLANDO LAWRENC[E: BER~<ELEY NAT~DNAL LABORATORY

Computational Biology and High Performance Computing

Manfred Zorn, Teresa Head-Gordon, Adam Arkin, Brian Shoichet, and Horst D. Simon

National Energy Research Scientific Computing Division

October 1999 To be presented at Supercomputing 1999, Portland, OR, November 14-19,1999, and to be published in the Proceedings

, IlJ ---:E: "1 ::u ro rn :::s (,")0" ;() -'. 0 rn ro "1 ro ::u

() Ul rn OJ c z ro ..... z('") "1 IlJOrn A rI-rI-ro ro ('") :-' 0 ro '1J '< OJ -< ..... 'z 0----IlJlO r+ • -I.

0 U1 :::s l'Sl IlJ -' ,

-'. ',0-IlJ "1 0"1lJ o "1 "1'< IlJ r+ I' ('")

0 0 "1 ;:0 -a < ro '<

-n I-'

, OJ Z , I

.f.:>

.f.:> +:>0 m l'Sl

l I ,

DISCLAIMER

This document was prepared as an account of work sponsored by the United States Government. While this document is, believed to contain correct information. neither the United States Government nor any agency thereof. nor The Regents of the University of California. nor any of their employees. makes any warranty. express or implied. or assumes any legal responsibility for the accuracy. completeness. or usefulness of any information. apparatus. product. or process disclosed. or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product. process. or service by its trade name. trademark. manufacturer. or otherwise. does not necessarily constitute or imply its endorsement. recommendation. or favoring by the United States Government or any agency thereof. or The Regents of the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. or The Regents of the University of California.

Ernest Orlando Lawrence Berkeley National Laboratory is an equal opportunity employer.

LBNL-44460


Manfred Zorn, Teresa Head-Gordon, Adam Arkin, Brian Shoichet, and Horst D. Simon

National Energy Research Scientific Computing Division Ernest Orlando Lawrence Berkeley National Laboratory

University of California Berkeley, California 94720

October 1999

This work was supported by the Director, Office of Science, Office of Advanced Scientific Computing Research, Division of Mathematical, Information, and Computational Sciences, of the U.S. Department of Energy under Contract No. DE-AC03-76SF00098.

* recycle<! paper

... " ..... , '~,~gy ~"'U~W <Cl."''''CCOhIP",,'''C',,''Q


Presenters:

Manfred Zorn - Co-Head, Center of Bioinforrnatics and Computational Genomics, NERSC

Teresa Head-Gordon - Scientist, Physical Biosciences Division, LBNL

.... ,""-', ....... y" ..... ,,"" «"",,,,ccc ... ,,,,..,c<t<r ...

Adam Arkin - Scientist, Physical Biosciences Division, LBNL

Brian Shoichet, Northwestern University

Organizer: Horst D. Simon - NERSC Director

November 15, 1999

Supercomputing 99-Portland

Abstract

• The pace of extraordinary advances in molecular biology has accelerated in the past decade due in large part to discoveries coming from genome projects on human and model organisms. The advances in the genome project so far, happening well ahead of schedule and under budget, have exceeded any dreams by its protagonists, let alone formal expectations. Biologists expect the next phase of the genome project to be even more startling in terms of dramatic breakthroughs in our understanding of human biology, the biology of health and of disease. Only today can biologists begin to envision the necessary experimental, computational and theoretical steps necessary to exploit genome sequence information for its medical impact, its contribution to biotechnology and economic competitiveness, and its ultimate contribution to environmental quality.

~ Supercomputing 99-Portland

1

~.,,< ... L'''''aY'''''~''~_ ~~""<'''lCe_.""..,c,,,,, ..

Abstract (cont.) ,:~

• High performance computing has become one ofthe critical enabling technologies, which will help to translate this vision of future advances in biology into reality. Biologists are increasingly becoming aware of the potential of high performance computing. The goal of this tutorial is to introduce the exciting new developments in computational biology and genomics to the high performance computing community.

_'''~''''''~"YR<><'''~_ o<oc .... ""Cot .. "~".,«N" ..


Tutorial Outline

• 1 :30 - 2:00 p.m. Overview of Computational Biology --Teresa Head-Gordon

• 2:00 - 3:00 p.m. Bioinformatics -- Manfred Zorn

• 3:00 - 3:30 p.m. Break • 3:30 - 4:00 p.m. Protein Structure Prediction and

Folding --Teresa Head-Gordon

• 4:00 - 4:30 p.m. Docking/Molecular Recognition -- Brian Shoichet

• 4:30 - 5:00 p.m. Cellular Networks -- Adam Arkin

Supercomputing 99-PorUand

2

~.,,-, .~ .... ~. ~ .... ~<~ .e.,,,,,,,eeo ... ,,,, ... e"".R

Computational Challenges in Structural and Functional Genomics

~"'''''''''~'"Q'''''''.C~ «".".,,,c e"'''''''''' ct><lu

Teresa Head-Gordon Physical Biosciences and Life Sciences Divisions

Lawrence Berkeley National Laboratory

November 15, 1999


(1) Why computational biology?

(2) Community effort to define problems with genuine computational complexity

Genome analysis, gene modeling, sequence-based annotation

Low resolution fold prediction: Single Molecule

High resolution structure prediction and protein folding: Single Molecule

Molecular recognition or Docking: Multi-molecule complexes

Cellular Decision modeling

(3) Putting it all together:

Deinococcus radiodurans

Center for Integrative Physiome Analysis (CIPhA)

'-Supercomputing 99-Portland

3

..... " .... " ,H, •• Y P ..... Cw

."",.,."'CCO".."~I ... C.,.,.U

Revolutionary Experimental Efforts in Biology

Sequence

1 Genome projects

Microbial organisms C elegans

Human

Structure Function

Structural Genomics Initiative High throughput effort underway

NIH, new beamlines LBNL: ALS Functional Annotation


Initiatives Gene deletion projects

Yeast two-hybrid screening Gene expression micro-arrays In vivo GFP protein (kinetics)

... ",."".N«.y." .... Cw H'<...,'.,cc .... c.,,"'«""<~

Computational Biology White Paper

http://cbcg.lbl.gov/ssi-csb

A technical document to define areas of biology exhibiting computational problems of scale

Organization: Introduction to biological complexity and needs for advanced computing (1) Scientific areas (2-6) Computing hardware, software, CSET issues (7) Appendices

For each scientific chapter: illustrate with state of the art application (current generation hpc platform) define algorithmic kernals deficiencies of methodologies define what can be accomplished with 100 teraflop computing

~Community document ~ More organized CB community in government labs, universities

~Support for CB by the broader biological community

/Supercomputing 99·Portland

4

~~"<HA"H"gT.'''.''CY

H'<"·""'CO"'L.""'~'''''''

High-Throughput Genome Sequence Assembly, Modeling, and Annotation

The Gellome Challllel Browser to access alld visualize currellt data flow, allalysis alld modelillg. (Manfred Zorn, NERSC)

~ Genome sequencing and annotation ~ Bioinformatics

100,000 human genes; genes from other organismStructure/functional annotation at the sequence level

Computation to determine regions of a genome that might yield new folds Experimental Structural Genomics Initiative

~''''''''''''''OT.''''RCY H',"''''C co .... """" CCN'H"

Functional annotation at the structure level by experiment


Characterize the Link Between Protein Sequence and Fold Topology

Sequellce Assigllmellts to Proteill Fold Toplogy (David Eisenberg, UCLA)

~~~ (~:i~'1

~!f¥ ~~~'~~~'~ ~ .... ' , •• 4- -,

'"H"~~"

,'I~<~,~.hl ';:~l;.,::,,:.~~

~::, <,\0 ~.,~-..

'~' "': ./

H_" .• ' .. ~ ... ,,..

r~4" ~~ ... ';.~~~~.~~.~.~',"

,itt{j;JJi' .,>ow_ .

'/!f'~~ ~ ,.!~'" ~-:;%:::-~ ... -.'

'~'::':T£:l ".,,~·;~~1"~~';':A~~

~ Experimental Structural Genomics Initiative

Define basis set offolds: ~103 structures to be determined Predict Fold Topology from Computation (~105 folds)

Functional annotation at the structural level by computation

Supercomputing 99-Portland 10

5

k .. , ...... CI~I.~Y~.O<.~c ..

o<""''''eeo ... C ... · .. U''''U

Low Resolution Fold Topologies to High Resolution Structure

Olle microsecolld simulatioll of a fragmeflt of tile proteill, Villill.

Illfluellza virus poised above a model of a lipid membralle will illvolve a 100,000 atom MD simulatioll over 10llg timescales to Illlderstalld tllis step ill tile mecllallism of viral ifife,etiOlI. (Tobias, UCI)

Dllall & Kollman, Sciellce 1998

Low Resolution Structures from Predicted Fold Topology

Fold class gives some idea of biological function, but ..•. ,

k.""' .. " 'H'~~' k""'"" .. S<"H""ce""""C.',.,<."',q

Higher Resolution Structures with Biochemical Relevance Drug design, bioremediation, diseases of new pathogen

Supercomputing 99~PortJand

Simulating Molecular Recognition/Docking

Changes in the structure of DNA that can be induced by proteins. Through such mechanisms proteins regulate genes, repair DNA, and carry out other cellular functions.

Improvements in Methodology and Algorithms of Higher Resolution Structure

~ Breaking down size, time, length scale bottlenecks (IP, algorithms, teraflop computing)

Protein, DNA recognition, binding affinity, mechanism with which drugs bind to proteins

Simulating two-hybrid yeast experiments Protein-protein and Protein-nucleic acid docking

'Supercomputing 99~Portland

11

12

6

~~t"·"C'N<>;.'R"U"CN

.<I ..... "'e<"""'L .. ""e'''''· Modeling the Cellular Program

elements (i.e. they cross-talk). From the Signaling PAthway Database (SPAD) (http://www.grt.kyushu-u.ac.jp/spadl)

~ Integrating Computational/Experimental Data at all levels

Sequence, structural functional annotation (Virtually all biological initiatives) Simulating biochemical/genetic networks to mode cellular decisions

Modeling of network connectivity (sets of reactions: proteins, small molecules, DNA)

Functional analysis of that network (kinetics of the interactions)

k .. " ...... N ........ U .. Cu ."' .... '.-.ce_"', ... e< ......


Implicit Collaborations Across the DOE Mission Sciences

Computer Hardware & Portability Applications described running on various platforms

T3D, T3E, IBM SP's, ASCI Red, Blue

Information Technologies and Database Management Integrating biological databases; CORBA and java Data Warehousing ultra-high-speed networks

Ensuring Scalability on Parallel Architectures implicit algorithmic scaling paradigm/software library support tools for effective parallelization strategies: 100 teraflop

Meta Problem Solving Environments geographically distributed software paradigm: "plug and play" paradigm

Visualization Querying data which is "information dense"


13

14

7

~"''''''C,,,,,,q,~'''''"'~ ","HT",.~_c""'u ..... ~

Feedback from Biotech Industry Meeting

Jim Cavalcoli, Ph.D. Bioinformatics Manager, PDLMG Parke-Davis, Warner-Lambert

Pete Smietana, Ph.D. Senior Staff Software Engineer, Bioinformatics Ciphergen

Julie Rice Computational Chemist IBM-Almaden

LBNL 21zs199

Patrick O'Hara VP, BioMolecular Informatics ZymoGenetics, Inc SeattlcWA

Peter Karp, Ph.D. Scientific Fellow Pangea Systems

Eric Martin

Herve Recipon Asst. dir. bioinformatics diaDexus (Incyte)

Rick Bott X-ray crystallographer Genencor

Sr. Scientist Small Molecule Discovery Chiron

LBNL: Gilbert, Head-Gordon, Holbrook, Mian, Rokhsar, Simon, Spengler, Zorn

We want to listen to Biotech industry perspective on Computational Biology white paper

Is there strong objection to any of the content? NO, very supportive

Are there other areas to be included, stronger emphasis placed? Will be a new chapter on databases: integrating, querying, visualization

Technical input: contribute a "vignette" on important Compo Bio. application Parke-Davis, Chi ron, Zymogenetics, Pangea

""'''''k< ... ~~.R, .. n~ .. '~"HT"'~~&"''''''''' «""~

Supercomputing 99~Portland

Center for Integrative Physiome Analysis (CIPhA)

NCRR submitted 211199 P.I.: Adam Arkin

Cell cycle, asymmetric division and differentiation in Caulobacter crescentus

Analysis of developmental pathways in C. Elegans

Analysis of databases of two-hybrid interactions

The role of cytomechanical and nuclear structure in mammary gland transformation

Interrelationships among the various tools and databases used and developed by the Center. Blue J'cctangles are databases built by the Center (with the exception of Ill1eract 1.0 which is provided courtesy of Roger Brent, Molccular Sciences Institute). Green boxes are off-site database. Hexagons are tools to be developed by this Center.

Adam Arkin, Mina Bissell, Roger Brent, Silvia Crivelli, Tarek Elaydi, Teresa Head-Gordon, Stephen Holbrook, Stuart Kim, Casimir Kulikowski, Harley McAdams, Saira Mian, lIya Muchnik, Lucy Shapiro, NERSC

, Supercomputing 99-Portland

15

16

8

~"'-"'-I~<~Q'A'''A.C~

H'''ff'ne ....... '-.' ..... "" ...

Deinococcus Radiodurans (DR: Strange Berry That Withstands Radiation)

Bacteria isolated from tins of spoiled meat given "sterilizing" doses of y radiation. 3xl06 base pairs, or ~3000 protein products fully sequenced by TIGR under DOE/OBER sponsorship

Three components to DR's successful DNA repair strategy specifics of the DNA repair mechanism the fact that it is multi-genomic coupling of repair, replication, export of damaged DNA from intracellular medium.

Propose to construct molecular models of key components of the DNA repair system: Damaged DNA Multigenomic repair intermediates such as Holliday junctions Proteins known are yet to be discovered to be.involved in DNA repair Protein-protein or protein-nucleic acids that couple repair, replication, transport.

Developing better fold recognition, comparative modeling, and ab initio prediction methods, and docking methods to describe macromolecular complexes.

Application of methodologies will be to fully and completely annotate the DR genome Learn underlying components of highly-honed strategies for DNA repair in DR.

Involves significant portions of community white paper on high end computing needs

~A"UH"'H""'R'''''A''C'' .'~(Kl"-'CC_"""'C'''''''''


The Need for Advanced Computing for Computational Biology

Computational Complexity arises from inherent factors:

100,000 gene products just from human; genes from many other organisms

Experimental data is accumulating rapidly

N2, N3, N4, etc. interactions between gene products

Combinatorial libraries of potential drugs/ligands

New materials that elaborate on native gene products from many organisms

Algorithmic Issues to make it tractable

Objective Functions

Optimization

Treatment of Long-ranged Interactions

Overcoming Size and Time scale bottlenecks

Statistics

.- Supercomputing 99-Portland

17

18

9

~~""".LU'~.Y"''''~~~ .e>< .... "'CCNAI'"' ... «,,,-, ..

Acknowledgements for Commnnity White Paper in Computational Biology

The First Step Beyolld tlte Gellome Project: HigltThrollgltpllt Gellome Assembly, Modelillg, alld AIIllotatioll

P. LaCascio, R. Mural, J, Snoddy, E. Uberbacher, ORNL S. Mian, F. OIken, S. Spengler, M. Zorn: LBNL David Sf!1tes, Washington University

From Gellome Allllotatioll to Proteill Folds: Comparative Modelillg alld Fold Assigllmellt D. Eisenberg, UCLA A. Lapedes, LANL A. Sali, Rockefeller University B. Honig, Columbia University

Low Resoilltioll Folds to Higlt Resolutioll Proteill Structure a"d DYflamics C. Brooks, Scripps Research Institute P. Kollman &Y. Dnan, UCSF A. McCammon & V. Helms, UCSD G. Martyna, Indiana University D.Tobias, UCI T. Head-Gordon, LBNL

Biotecltllology Advallces from Complltatiollal Strllctllral Gellomics: III Silico Drug Desigll alld Mecltallistic Ellzymology

R. Abagyan, NYU, Skirball Institute P. Bash, ANL J. Blaney, Metaphorics, Inc. F. Cohen, UCSF M. Colvin, LLNL I. Kuntz, UCSF

Lillkillg Strllctllral Gellomics to Systems Modelillg: Modelillg tlte Celllliar Program

A. Arkin & D. Wolf, LBNL P. Karp, PangeaS. Subramaniam, U Illinois Urbana

Implicit Collaboratiolls Across tlte DOE Missioll Sciellces

M. Colvin & C. Musick, LLNL T. Gaasterland, ANL (now Rockefeller) S. Crivelli & T. Head-Gordon, LBNL G. Martyna, Indiana University


~"'< ... L'"'U'."",~.C~ <c.u",,,C c_c",'" C<""'Q

Bioinformatics

Manfred D. Zorn

November 15, 1999


19

20

10

~ .. ".,., ... ,qqyq"'HC~

1¢«Kf"'C~",",,"""'''>«'''<'' Overview

• 30 seconds of Biology • DNA Sequencing: View from 10,000 feet

• Genome Analysis

• Genome Projects

• Identify a possible gene

• Characterize a gene

• Large-scale Genome Annotation

• What's supercomputing got to do with it?

• Challenges


Biology is Special ~"'''''''L .... ~qyq .... ~c~ 'C'''''''''C C_L ..... a .......

Life is characterized by

• Individuality

• Historicity

• Contingency

• high (digital) information content

'Supercomputing 99-Portland

21

22

11

~,,,,,,,,," .. ~~YA,_n~K ~e1,...,ItICCOf.oO'L"'" «"'.~

Rough endoplasmic reticulum ..

Basic Biology

Goigi apparatus

Nucleus Mitochondrian smooth endoplasmic reticulum


DNA padcs Ughlly Inl<> metaphase chromosomes-

Fundamental Dogma k .. '< .... '.~,OOY~"UACK «.c ... ""c CO"'L"'''' C""'A

DNA 1

RNA

prot~ins 1

Circuits 1

PhenolVpes 1

Populalions

.. Supercomputing 99-Portland

23

24

12

~ .. ".,," '~"Q' ~ .. 'u~c~ o<",......-'ec ..... ~.'..,c",""

Dodson, 1998

" .. " .... C,~, •• y ...... ~CM ."'""'",,.o .... c ... .o« ......

DNA Codes

r.",,41Y(hUd'~Mol'bI¥. 1(>i;.1~1'6,I""'% Nolt/JflUC'''Y,s.,,.,,, ...

I ~ fi ft


DNA Sequencing

10 0 ~"r

0 0 \.t,,,,.

t~ 0) If,O":.') "''''y

0 !§jl 0

Read base code from storage medium!

• Read length: About 600 bases at once

• Reader capacity ./100 lanes in parallel in about 2-5 hours

. Supercomputing 99·Portland

0 @> ~ - '""'" 0 @ @ :p:

~ 0 0 PUj«<'i ~;"t:,'t;¢X>

~ 0 '(. ~ .'

26

13

~""""' .. ,~gY" .... ne~ oc",..,'''CC'''''''"'''''CtHt'~ Sequencing: "bird's eye view"

• Prepare DNA • about a trillion DNA molecules

• Do the sequencing reactions

• synthesize a new strand with terminators

• Separate fragments

• by time, length = constant

• Sequence determination

• automatic reading with laser detection systems

_."< ..... C ... , ••• k' ... ~c~ .e"""o<,ce_C"""C<NH"


Sequence Traces

27

14

~""""L .~,~~" ~""AR<H «« • ..,,,'c CO ....... ,,," CC ......

Human Genome Project -Goals

• Construction of a high-resolution genetic map

• Production of a variety of physical maps of all human chromosomes and of selected model organisms

• Determination of the complete sequence of human DNA and DNA of selected model organisms

• Development of capabilities for collecting, storing, distributing, and analyzing the data produced

• Creation of appropriate technologies necessary to achieve these objectives


~ .. " ... '-'"'-~ •• ~' .. U<H ''''''''''''eco .... ''''''''COfTU

Genome Projects

• Model organisms sequenced

• E. coli 4.5Mb

• S. cerevisiae

• C. elegans 100Mb

• Dozens of bacteria I-6Mb

• D. melanogaster 140Mb

• Human

• 408Mb

• ~14% of the genome

~Supercomputing 99-Portland

29

30

15

~",," ... L'H'.,,",.'.~.CH Base Pairs in GenBank O<' ..... "'ccc .... c.""1,,,,,,,.


~ .. " •• " '"" ••• , .... >a<~ .COC .... "', C"",,"L"", c, ...... DNA Analysis

Disassemble the base code!

II Find the genes

• Heuristic signals

• Inherent features

• Intelligent methods

II Characterize each gene

• Compare with other genes

• Find functional components

• Predict features


I:j~

31

32

16

~''',"'AL'N'~aY.''''A'''''' U.,"'"-'CCO ....... ,,,.,,"'.

.... " ... ,.".~.y., .... "" .. S<""""«_L~''''C'''''.

What is a Gene?

University of Pennsylvania Computational Biology and

Informatics Laboratory

Heuristic Signals

DNA contains various recognition sites

forinternal~achinery

• Pro~oter signals • Transcription start signals

• Start Codon

• Exon, lJitron boundaries

• Transcription ter~ination signals

"Supercomputing 99-Portland

33

34

17

Heuristic Signals

Heuristic Signals

18

~.,,<.,.., .... ~~y ~ ... U~ .. Inherent Features "'><Kf"'ee_"" .... u""~

DNA exhibits certain biases that can be

exploited to locate coding regions

• Uneven distribution of bases

• Codon bias

• CpG islands

• In-phase words • Encoded amino acid sequence

• Imperfect periodicity

• Other global patterns

Supercomputing 99-Porthmd

-~-, "!'-'~11"1 1'1*194~_ ..

37

Inherent Features"~~ f1 I .... " .. ~ ~ .. "' ... '.H .. ~.N' ... """

",,"''''cc''''''','' ... eO'''R

GGT AG(: CAG

GTe

GAG 'fAA

T+Ctdpl~1>4

~ \\ AGG \"

\ " \ ,

\ , GGG triplet\< \ '\

\


Solovyev. 1994

38

19

~A,,, ... L IMI~~'R, ... kc~

''''<!<f'''.C ...... '''''''cP<l,~ Intelligent Methods

Pattern recognition methods weigh inputs

and predict gene location

• Neural Networks

• Hidden Markov Models

• Stochastic Context-Free Grammer

.... "-..LI~<"OYR''''' •• c~ «"!<f"'CC ...... L~'..,c .. "'''


N enral networks

!6-mer vocabulary

!6-mer-in-frame

! Markov

Isochore GC Composition!--~

Exon GC Composition I Size prob. profile I Length I Donor I Acceptor I Intron Vocabulary 1 I I Intron Vocabulary 2 I 1----------- -- Supercomputing 99-Portland

39

40

20

~ .. " .. " .... u •••• ou~eH .<' .. rr""c'Q"'~""".<""<I.

~ .. " ... ,- ... ,,,~.~ .... ~eH «" .. ""e ....... ~ .. "" « ..... R

Hidden Markov Models

Silent states

Production states

Supercomputing 99-Portlnnd

Characterize a Gene

Collect clues for potential function

• Comparison with other known genes, proteins

• Predict secondary structure

• Fold classification

• Gene Expression

• Gene Regulatory Networks

• Phylogenetic comparisons

• Metabolic pathways

"Supercomputing 99-Portland

41

42

21

~A<".'" 'N'~.Y."".<~ U,<o(!"'C~_"""":<""~

Large-scale Genome Annotation

• Multi-laboratory Project

• Standard Annotation of Genomes

~~ • Genome Channel • Genome Catalog

• Comprehensive integration of

• Analysis tools • Data management systems

• Data mining Compu(eServers

• User services

• Extensible Framework C6taWarehouse

• High-performance computing 8!oPamm~ters BIDSequences

~."'" .... N •• o;y .... A.c-. .<"''''..-'«0 .... ''''..,<.''',.

• Data integration technology • Artificial intelligence


Annotation Pipeline

<' Supercomputing 99-Portland Data Sour cos

Genome Genie'S

44

22

.... "" .... ,.H.~".~ ... U~M O<""''''C<_C'''''C<H''~

k""'''"''''~"' k"'''"C~ .".'",,,.CO ...... ,.;«""Q

Genome Channel

Feature Display

'·Supercomputing 99-Portland

'.·;~f~1 "!"IM'~""

4S

46

23

~.,,< .. >l'''''~Y~''''A''C~ .m",,,,eco""""""c,,,,n

~"".L"'N"~''''''Ubo ''''' .... '''ec_''''''''« .... u

Gene Search - BEAUTY Results

- CoIIl('~fDrRU~Sctn!.

$'-~-QUERyl", _______ ...,, ________ .... ,

100 200

SeD!::!:

Sequ""u;:C:I producing :I'lloiticollt .. ll'iJT1'1"'ot:l: (bit"" Value

~~~~~:::::! ::~:~c:!::~h;!:ar:!;~::...~rlIl-rep"at prote1... 253 ::=:! oeu",.1 pltllcophlUn rdated o.nll-I:Cpellt pl:ote1... 9c-62

, pOOl1 pl:ote1n (lIomo lIClpicn::o} prot"ln. (Homo :oeplell!l] 109

'"

SUllcrcomputing 99-PortIand

Highlights - Data Analysis I:i:~ Objects databases processes

,r Supercomputing 99-PorUand

47

48

24

~'''''''~'H'~.'''''"'UC'' S<">, .... "'~<:_"''''''c<>n<R

What's supercomputing got to do with it?

• Complexity of the information

• Amount of data

• Most applications are trivially parallel

k""""'''''UY~''''''C'' "''' .... ..-'e c_"".., c."' ...


Layers of Information

The same base sequence contains

many layered instructions!

• Chromosome structure and function • Telomers, centromers

• Gene Regulatory information • Enancers, promoters

• Instructions for gene structure

• Instructions for protein • Instructions for protein post-processing and

localization

r' Supercomputing 99-Portland

49

50

25

r.,"mm .. ~., ~.,,, ... , <~.~.Y ~""~~N Moore's Law and Genomics I~;:-~::!~!~ 00."''',-,«_, .. ''''< ......

Spec95 Integer Performance vs. Genbank Search

10

Genbank

8 search time

6

4

"14+W" 2 g 8 88 0 o § 0

0 o compute

0 pedormance

-2

1990 1992 1994 1996 1998

"' .. Slales 1998


The Shape of the Wave ~.", ...... ~".Y~'''UCN ><.' .... '''e<'''''.''' ... «''', •

• 1999 ./ JGI releases 150 Mbases draft ./ Celera releases the sequence of Drosophila (140 Mb) ./ Public "draft" effort reaches halfway point (1,500 Mb) ./20 more Microbial genomes completed (80 Mb but 60,000 genes) ./ First release ofCelera "shotgun" (9,000 Mb)

.2000 ./ JGI releases 150 Mbases draft ./ Public "draft" completed (1,500 Mb) ./ Mouse "draft" begins (500 Mb - comparisons with human) ./ Two more Celera shotgun releases ( 18,000 Mb) ./ 40 more Microbial genomes sequenced (160 Mb -120,000 genes)


51

26

CPU Requirements ..... " ... " '~'.4'~'''''~CW ><t ..... ,n.c""""~.""' «."'''

• Current annotation

• 250 Mbases DNA yield ~125 Gbytes of data

• It takes ~ 7.5 days on 20 workstations ~3,600nhr

• Celera Data • 9 Gbases (36x) in small pieces every 3 months ~2,000 hr.

• Analysis time approx. quadratic (1300x)

• 1,300 x 3,600nhr / 2,000 hr. = 2,340 nodes

• Celera Sequencing

• Assembly of 1.7 Million reads in 25 hrs

• Annotation 8-10 Mbases per months with 6 FTE

• Assembly of Human Genome: expected ~ 3 months

~ .. '''''' 'H"~'.'''''.'~ «.,""" ....... "" .... ,"', ..


Proj ected Base Pairs

.r Supercomputing 99-Portland

53

54

27

~ .. " ... L'N,~oy",""~CH

><o<",,,,e~_""""~'''''. . Sequence Assembly

• Complexity • Adding a day's read of 100 Mb to a billion base pairs of

contig would require 100 Pops operations

• A 1 Tops machine would take about one day to process 100 Mbases

.... " ... C.N.U.A"' .. ""w «""""C C"' .. L ..... «"',.


Assembly / Integration / Modeling

• BAC end integration .r JGI draft (1st half) = 300 Pops

.r first Celera release requires = 3,000 Pops

• Draft and whole genome shotgun integration .r JGI draft (1st half) + Celera first release = 1,300 Pops

• Gene modeling .r Celera first release (9Gbases) - 1 day of Paragon time

• Placing STSs .r JGI draft requires = 9 Pops

.r Celera first release = 90 Pops

,Supercomputing 99-Portland

55

56

28

..... "( ....... ,~~V~ ... A.< .. OC,"" .. ,CC" ....... ',.,U .... R

150.00

100.00

50.00

........... "'~Qy .. , .. u.,c~

.m"",,,cc_""~lC'''''''

0.00 year

0.03

Data Transfer

month week day

0.39 1.65 11.60


Challenges

• Discovering new biology

• Lack of software integration

12 1 hour hours 23.10 2n.l0

• Beginning to build high-performance applications

• Shortage of personnel


51

58

29

""".'-'C'h"~'~'''''QCw ""'''''''0 COO ... c .. .., OC.."Q Comparative Genome Analysis

eooXle smg tulV hslU f1g9

pomHPOO7~HPOO71 !7:!ille4

n,dF dhf, thyA UG226 1.I022'S 1.10223 N022.2r.tIJWA.AIjJ

~"'''''''''''~~Y~'''AQCK O<O<HTO<,c C"""C"'" «"',Q


Alternatively Spliced ?

''Supercomputing 99-Portland

aful

aful

bbur

bsub

ecoli

hlnf

hpyl26695

mgen

mjan

59

60

30

~"'''''''L '~'~G"~'''A''''~ .~'<o;t",eC"",,,,, ... e,...,U

4.1R

RNAs

~A"u .. L'N'~GY., ... kc~ «O<"""e co ...... ,»:: um'Q

One Gene - Many Proteins

ATG·1 ATG·2

~ 14 15 16 '" 178 18 19 20


9p21 Gene Cluster is a Nexus of the Rb and p53 Pathways

Extracellular stimuli (i.e. TGF-P)

Oncogenic .--- stimuli (i.e.H-Ras,-..

._- - -- -- - - (j) - - - - - - - - -- - -I pRb

~ • _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ E2F..;" Cell Cycle Progression

'" Supercomputing 99-Portland

Conboy 1998

61

62

31

~""_C'~'."'A'_AAC~

""on"'C~"""'c""'«""<R

• NERSC/LBNL

• John Conboy

• Donn Davy

• Inna Dubchak

• Sylvia Spengler

• Denise Wolf

• Eric P. Xing

• Manfred Zorn

.... ''' ... C.H' ... '' .. ' .. UC ..

• "'<><r""~b""<-""'~<>''''.

Credits

.ORNL

• Ed Uberbacher

• Richard Mural

• Phil LoCascio

• Sergey Petrov

• Manesh Shah

• Morey Parang


Protein Fold Recognition, Structure Prediction, and Folding

Teresa Head-Gordon Physical Biosciences and Life Sciences Divisions


November 15, 1999

-- Supercomputing 99-Portland

63

64

32

~ .. , ...... ... ,~~v .. , .. uc~ ."' ..... oneco .... "" ... .,."' ...

Protein Fold Recognition, Structure Prediction, and Folding

(1) Drawing analogies with known protein structures Sequence homology, Structural Homology

Inverse Folding, Threading

(2) Ab initio folding: the ability to follow kinetics, mechanism

robust objective function

severe time-scale problem

proper treatment of long-ranged interactions

(3) Ab iuitio prediction: the ability to extrapolate to unknowu folds

multiple minima problem

robust objective function

Stochastic Perturbation and Soft Constraints

(4) Simplified Models that Capture the Essence of Real Proteins

Lattice and Off-Lattice Simulations

Off-Lattice Model that Connect to Experiments: Whole Genomes?


What is a protein? .... " ...... ,.~,~Q ... , ... ~""~ sc""''''e<: ..... ~.' ... <:,'''''.

A biopolymer which is distinct from a heteropolymer in one very important way

It's 3-D structure is uniquely tailored to perform a specific function

• Alanine

0 Proline

• Threonine

-Tryptophan

• Isoleucine

NMR, X-ray and electron crystallography solve structures slowly (1/2-3 yrs.)

/Supercomputing 99-Portland

65

66

33

.... " .... c.".~~ .. '''''~''''~ .<1" .... ,"C eo ... """",, C"'''Q

~"""'C.".qqYA"",AAC" o<" .... ,,'cc_"" ... c,,"'.

The "Beads" are Chemically Complex Structures


Protein Fold Recognition:

-~r

riP*'"

Threading

Sequellce AssiglllllelltS to Proteill Fold Toplogy (David Eisenberg, UCLA)

Take a sequence with unknown structure and align onto structural template of a given fold

Score how compatible that sequence is bllsed on empirical knowledge of protein structure

Right now 25-30% of new sequences can be assigned with high confidence to fold class

100,000's of sequences and 10,000's of structures (each of order 102-103amino acids long)


67

68

34

Protein Fold Recognition: ~"''''''''''.G'~''''.~N H'<NT''''CC_~'''''«NT'''

Computational Approach:

Threading

DYllamic programmillg: capable of finding optimal alignments if

optimal alignments of subsequences can be extended to optimal alignments of whole

objective functions that are one-dimensional E=I Vi +I V gap

Complexity: all to all comparison of sequence to structure scales as V Whole human genome: 1013 flops

Improve Objective function:

Take illto accollllt strllctllral ellvirOllmellt

3D-71D: dynamic programming, V Bllild pairwise. or mlllti-body objective fllllCtioll

NP-hard if: variable-length gaps and model nonlocal effects such as distance dependence

Recursive dynamic programming, Hidden markov models, stochastic grammers

Complexity: all to all comparison of sequence to structure scales as U Whole human genome: -1016 flops


... ","", .. ~.~.y ... " .... <: .. Computational Protein Folding

«,'''' .... cc_~., ... c,'''u

Olle microsecolld simlliatioll of a fragmellt of lite proteill, Villil/. (Dllal/ & Kollmall, Sciellce 1998)

(1) robust objective function.f

all atom simulation with molecular water present: some structure present

(2) severe time-scale problem.f

required 109 energy and force evaluations: parallelization (spatial decomposition)

(3) proper treatment of long-ranged interactions X

cut-off interactions at sA, poor by known simulation standards

(4) Statistics (1 trajectory is anecdotal) X

Many trajectories required to characterize kinetics and thermodynamics


69

70

35

~"''''''L '~<~~Y"'''''A~C~ 'C>"'''''~C''''''''''''C..,,,n

Computational Protein Folding

(1) Size-scaling bottlenecks: Depends on complexity of energy function, V

Empirical (less accurate): CN2; ab initio (more accurate):CN3 or worse; c«C

empirical force field used

"long-ranged interactions" truncated so CM2 scaling; M < N

spatial decomposition, linked lists

(2) Time-Scale of motions bottlenecks (.M)

r.{t + Llt)= 2r.(t)- r.{t _ Llt)+ f;{t)(L1t )2 + O[{Llt Y ];v.{t)= r;{t + Llt)- r;{t - Llt) + o [{Llt )3 I I I Ill. 2! I 2Llt

I

f; = lIl;a; = -Vr~1,r2, .. ·rN).

Use timestep commensurate with fastest timescale in your system

bond vibrations: O.OlA amplitude: 10-15 seconds (Us)

Shake/Rattle bonds (2fs)

Multiple timescale algorithms (-Sfs) (not used here)


~~r""'L'~'"~Y"''''''''w H'''''lncCot.<I' .. v,"",c"·",,,

1 Microsecond simulation of Villin Headpiece in Water

Generate 109 steps; Assume 1 teraflop machine; 1000 Flops per energy/force evaluation

N2 evaluation of energy & forces N evaluation of energy & forces 14 rr:::;;::;~~::::;-------=--.-

~:.1010DO atoms

12 D100,OOO atoms I 10 .1,000,000 atomsl

6 +----..-___ -L-~I-

211 L • iii 4 I !l f o III t~ I II -..-_-..--.-

3 4 5 6 7 8 9 10 11 12 13

log(number of seconds on 1 teraflop machine)

Ewald Sums:

6 +----.rlhI ~

~ 4 1 2 345 6 7 8 9

log (number of seconds on 1 teraflop machIne}

N [ <Xl erfc Ir .. + "I -f- ~ 2 2 l qq=.2:.II= q;qj 1(' "I \+ 3 2:q;qj n; exP(-k 14K )COSf·rij) +VselJ

I>J Il~O ltij +" I ,r,. k",O k t • Particle Mesh Ewald (N)

Spatial Decomposition in r-space; Parallelization of FFT's in k-space • Evaluate full Ewald sum in r-space using FMM techniques

71

./ Supercomputing 99-Portland ------------lIII!!n!!'i

36

... ""'"<.H ......... _AR<~ H'<"""~<_Vl''''Ct><l' ...

Ab Initio Protein Structure Prediction

Primary Squence and an Energy function ~ Tertiary structure

Empirical energy functions: (1) Detailed, Atomic description: leads to enormous difficulties!

_ # BOllds { _ \2 # Allgles { _ \2 # Impropers ( _ )2 VMM - ~ kb\h; hoJ + ~ k(J\e; eoJ + ~ kr "; "0 +

I I I

#dilledrals [ ] # atoms # atomsjq.q . [(0" .. )12 (0" .. )6]) # atoms L k¢ 1+cos(III/J+8) + L L -'--L+e .. --'L - --'L + L LlO"A i i i<j rii IJ r ij r ij i

(1) Multiple minima problem is fierce

Find a way to effectively overcome the multiple minima problem

(2) Objective Functions: Replaceable algorithmic component?

Global energy minimum should be native structure, misfolds higher in energy

~""""'H,~g"~''''kc_ oc"''''neco..<>c ..... uHT ...


The Objective (Energy) Function

Empirical Protein Force Fields: AMBER, CHARMM, ECEPP "gas phase"

CATH protein classification: http://pdb.pdb.bnl.gov/bsmlcath

a-helical sequence/ ~-sheet structure ~-sheet sequence/a-helical structure

Energies the same! Makes energy minimization difficult!

Add penalty for exposing hydrophobic surface: favors more compact structures

Enativc folds < Emisfolds for a few test cases

Solvent accessible surface area functions: Numerically difficult to use in optimization


73

74

37

~""""'~'~8Y8''''''C'' «'."".,C c_"""" C<NHQ

Hydration Forces from Experiment/ Simulation and Optimization

~

20.0

15.0

10.0

5.0

- g,(r),gas -- 9c(r), cluster ..... ~. ge(r), aqueous

250 .--.m---~--~--~--.

m ~ 1:i.O

~ ~ .2! 5.0

1 g: -'>.0

J

...... 1(0), experiment - 1(0),9a5 .... ~ 1(0), clusler -- 1(0), aqueous

-1~·~2::'5---;:-0.;;;50---;:-0.7""5---:-'.'::::00,-----7.'25 Q/A-'

Find model gir) that best reproduces excess experimental signal, Ic-lQ) W(r) is "potential of mean force" between two hydrophobic solutes

(Feature Article,J. Pltys. eltell/., 1999)

V= AMBER + (predicted helices fixed) + W(r) like that from experiment

Global optimization can find no lower energy structures than crystal structures 1pou (72 aa), 3icb (77 aa) ,2utg_A (70 aa), 3c1n (145)


.... ,,,' ... ,.~ •• "" .. ' .. ARC~ «,,"',"cC<, .... ,., ... <.HTc ..

Neural Networks for 2° Structure Prediction

o Input units represent amino acid sequence

tit Hidden units map sequence to structure

• Output Units represent secondary

structure class (helix, sheet, coil)

---.. Weights are optimizable variables that are trained on database of proteins

Poorly designed networks result in overfitting, inadequate generalization to test set

Neural network design

input and output representation

number of hidden neurons

weight connection patterns that detect structural features

-- Supercomputing 99-Portland

75

76

38

Neural Network Results h~"_"~«"'.'''."", .. <e.'''''''CCOMI''''''''C,>IT,R

No sequence homology throngh multiple alignments

Train Test

Total predicted correctly = 66%

Helix: 51 % Cn =0.42

Sheet: 38% Cb=0.39

Coil: 82% Cc =0.36

Total predicted correctly = 62.5%

Helix: 48% Cn =0.38

Sheet: 28% Cb =0.31

Coil: 84% Cc =0.35

Network with Design: Yu and Head-Gordon, Phys. Rev. E 1995

Train

Total predicted correctly = 67%

Helix: 66% Ca=0.52

Sheet: 63% Cb =0.46

Coil: 69% Cc=0.43

Test

Total predicted correctly = 66.5%

Helix: 640/0 Cn =0.48

Sheet: 53% Cb =0.43

Coil: 73% Cc =0.44

Combine networks ofYu and Head-Gordon with multiple alignments

~"""'L<"'~"."""''''''' <C""'''',e""...",,><> " .. ' .... R


Neural Network Predictions As Soft Constraints In Local Optimization

Make neural network prediction of 2° structure for each amino acid

Network Output: Helix (P a' -1), Sheet (-1, Pp), Coil (-1, -1)

P a = probability of being helix

Optimize on following energy surface:

P p= probability of being sheet

Bias = V MM + V"1f/ + V HB

«Po and \110 define perfect helix values predictions define k~, kq>' and qj

Using optimized structure from Vbias

optimize on VMM (AMBER: unbiased objective function)

" Supercomputing 99~Portland

77

78

39

~"''''''C'H'~.'.''<A.C~ «",.""OC_C,,,,;«,,,,,.

Neural Networks Used To Guide Global Optimization Methods

Generate expanded tree of configurations

Prprlii"h'rI coil residues: generate random, dissimilar sets of $0 and \110

Explore tree configuration in depth:

Global Optimization in sub-space of coil residues: walk through barriers, move downhill

~""'N.' .H' ........... c .. • <><w.",ec_",,"::«>I1'.

Supercomputing 99-Porlland

Neural Networks Used To Guide Stochastic Perturbation Algorithm

Stochastic/perturbation in sub-space of dihedral angles predicted to be coil

(1) Local minimization of a set of start points in sub-space

(2) Define a critical radius

rk

=[ c~rl2 r(l+%)V(j~gp rill a measure of whether a point is within a basis of attraction

(3) Generate many sample points in sub-space volume, V

(4) Evaluate r.m.s. between new sample points and minimizers of (1)

If (r.m.s. < r k) ignore this sample point

(5) Minimize sample points not in any critical distance and merge into (1)

Choose new set of dihedral angles and repeat

Probabilistic theoretical guarantees of global optimum in sub-spaces

Global optimization by solving a successive series of global optimum in sub-spaces?


79

80

40

.... " .... c I~.~~. ~, .... ~~" «""''''ce¢ .... ~~' ... e,,,,,~

Hierarchical Parallel Implementation of Global Optimization Algorithm

Static vs. Dynamic Load Balancing of Tasks

Central Processor

GOPTl t

GOPT2 t

t GOPT3

t GOPT4

t GOPT5

t WI,I~WI,1l W2,I~W2,1I W3!~W3'1I! W4,I~W4,1I W5,I~W5,1I

Central Processor: Assigns starting coordinates to GOPT's

Task time is highly variable l !

GOPT's: Divide up sub-space into N regions for global search

Task time is variable

Workers: Generate sample points; find best minimizer in region

(Number of workers depends on sub-space)

Dynamical load balancing oftasks: reassigning GOPT/workers to GOPT/workers

Gain in efficiency of a factor of 5-10

~""""''''~Q'~''UQC'' H.t .... "'cce ....... ''' ... ,,''', ..


Global Optimization Predictions of a,Helical Proteins

2utg_A: 70aa a-chain of uteroglobin: Crystal (left), Prediction (right)

R.M.S. 7.oA

Prediction (left) and crystal (right) R.M.S.6.3A

Still have not reached crystal energy yet!

" Supercomputing 99-Portland

81

82

41

~~''' ... '- .~ •• ~. ~''''R~~ .«" ..... '0."""''-...... < ... , ..

Simplified Models for Simulating Protein Folding

Simplifies the "real" energy surface topology sufficiently that you can do (1) Statistics ./

Can do many trajectories to converge kinetics and thermodynamics (2) severe time-scale problem.f

characterize full folding pathway: mechanism, khietics, thermodynamics (3) proper treatment of long-ranged interactions .f

all interactions are evaluated; no explicit electrostatics (4) robust objective function?

good comparison to experiments

k .. " .... '-U'.gy ..... ,,~~

«"""fo<C"....'-" .... " .. ,..


a/~ Protein Model Resembling IgG-binding Proteins Land G

+ Folding is highly cooperative, chain collapse accompanying folding.

+Two parallel folding pathways:

One pathway contains an intermediate-protein G One pathway contains no intermediates-protein L.

+ Sequence mutations affecting secondary structure propensities Similar to mutational experiments on Protein G & L

Same Hamiltonian can model all-13 (SH3) and all-a proteins (four helix bundles) -- Supercomputing 99-Portland

83

84

42

... " .... L I~'" •• ~ .. U~C~ ''''.K'I,.,C«''''''"",.,UHt, ..

Computational Complexity of Simplified Models for Protein Folding

9.25 r============> 1.0 r-~_ =.,---~--.......,

'.0

x Thermodynamics of the folding process are characterized using

multi-histogram method: complexity increases with multiple order parameters

constant-temperature Langevin simulations Folding kinetics are characterized by tabulating

mean-first passage times, and temperature scans One week using two Compaq/Dec EV10000 (--50 specfp95) per protein sequence

100,000 sequences for Human Genome; Ample mutational study data

... " ..... Llk'~~.~ .. U."" H ...... ''' •• " ...... ,or.«''',q


Acknowledgements for THG Research in Computational Biology at LBNL

Silvia Crivelli, Physical Biosciences and NERSC Divisions, LBNL

Betty Eskow, Richard Byrd, Bobby Schnabel, Dept. Computer Science, U. Colorado

Jon M. Sorenson, NSF Graduate Fellow, Dept. Chemistry UCB

Greg Hura, Graduate Group in Biophysics, UCB

Alan K. Soper, Rutherford Appleton Laboratory, UK

Alexander Pertsemlidis, Dept. of Biochemistry, U. Texas Southwestern Medical Center

Robert M. Glaeser, Mol. & Cell Biology, UCB and Life Sciences Division, LBNL

FUlldillg Sources:

AFOSR, DOE (MICS), DOE/LDRD (LBNL), NIH, NERSC for cycles

.,- Supercomputing 99-Portland

85

86

43

.. ,,', .... , .~,~~y,,"u~~~ .<" ..... "'ee~ ........ ',.,<<><l, ..

~""""""~Q." .... k<w «""""C <""",,","'" ,.,'" ..

Structure-Based Drug Discovery

Brian K. Shoichet, Ph.D

Northwestern University, Dept of MPBC

303 E. Chicago Ave, Chicago, IL 60611-3008

Nov 15, 1999


Problems in Structure-Based Inhibitor Discovery & Design

• Balance of forces in binding

• Energies in condensed phases ./ interaction energies

./ desolvation

• Problem scales badly with degrees of freedom

• Configuration .t configs a (prot-features)4 X (lig-features)4

• Conformation ./ Ligand & Protein, confs a 31bonds X 3pbonds

• Sampling chemical space (scales very badly)

• Defining binding sites


87

88

44

~ .. " .... c.""~v~ .. ,, ... c~ H",..,.,nc co ..... ,,,,, «''''Q The Pros & Cons of Proteins

18 - Crown-6

..... ,,, ... <.~'"Qy ... u.cw

.«"",<ocC""",",,,,,C.,,,,Q

sulfate binding protein

Supercomputing 99-Portlanrl

Conserved Residues, Ordered Structure, Function Unknown

. Supercomputing 99-PortJal1rl

89

90

45

~'''<>'''L ""~u'''''''~Cy '<'_mee_~',"'c'mU

Inhibitor Discovery or Design?

• Design ligands

• Ludi (Bohm)

• Grow (Moon & Howe)

• Builder (Roe & Kuntz)

• MCSS-Hook (Miranker & Karplus)

• SMOG (DeWitte & Shaknovitch)

• Others ...

• Discover Ligands • DOCK (Kuntz, et aI., Shoichet)

• CAVEAT (Bartlett)

• Monte Carlo (Hart & Read)

• AutoDock (Goodsell & Olson)

• SPECITOPE (Kuhn et al)

• Others ...


~A''''''L'N''''''A''''~CY .c",..,."cc"""' .... '''''« ..... Q

Screening Databases by Molecular Docking

! Dook ;nlo ,ito

Calculate encrgies

Tcst highseoring molecules

i . Slructure

C determination )

New inhibitor dosign

© Chemistry & Biology, 1996

--Supercomputing 99-Portland

91

92

46

Database Screening Using DOCK ...... ' '' ....... ~,~~y ... U.CH 1C<t>ff'''ee .... C.' ... e ..... u

Database of com m ercially avai1able sm all molecules

Each molecule is fit into the binding site in multiple orientations.

©--© 0 0

o Q OH Multiple conformations of each ligand are considered.

Each orientation is evaluated for complementarity, using van def Waals and electrostatic interaction energies.

Solvation energies ar~ subtracted. )' 0 ~J)I

~"" ...... " ......... ""w 0<'<1"""«0 .... "" ... ' .. "'.

The inhibition constants of the best fitting III oleclIles arc established in an enzym e assay'

t

. .. -200,000 com pounds

c Inhibitor-receptor complex structures arc determined. ~

New interactions with the enzyme are targeted. ~


Novel Ligand Discovery Using Molecular Docking

0 '1'0 II I, 1 Ii "j

unpublished'

" Supercomputing 99~Portland

93

94

47

.... " ...... C.H .. ~.~ ...... ~C~

1<"NT"'~~O""ur''''U''''CR

~ .. " ... c 'H'~~"~""~Cw ~M"""'ccO""C"""~<""~

2-,\S

de Novo Structure Prediction: blip/tem-l

Ligand Flexibility: Conformational Ensembles

Generate an ensemble . dock it into the site

--Supercomputing 99-Portland 96

48

... , ............... u.c .. Conformational Ensembles vs. Brute Force • "',"' ... cu ......... ' ... Cc .. ".

100.000

10.000

1.000

o " .£!.

" ~ 100

10

r.."mm ... ,,< .. AL .... ~ •• ~ ..... c .. cc""',,,.co ..... c ..... c ..... u

DHFR

Enzyme

TS LDH Receptor

Supercomputing 99-Portlnnd

Database Docking

Number of Time Confs Comps (hrs.)

Single Conformation Database

Complexed DHFR 5,761 5,761 0.58 Uncomplexed DHFR 5,761 5,761 1.40 Complexed TS 281 281 0.31 Uncomplexed TS 281 281 0.51

Multi Conformation Database

Complexed DHFR 867,822 5,656 0.94 Uncomplexed DHFR 867,822 5,656 2.96 Complexed TS 88,487 263 0.27 Uncomplexed TS 88,487 263 0.18

Full Multi Conformation Database

Complexed DHFR 33,717 ,639 115,349 26.50 Complexed TS 33,715,748 117,240 80.90


Trypsin TEM-1

~A ::i:~~::'-:Jl ,

known ngaRd resUlts Score RMS Rankin (kcaYmOI) (A) Database

91.9 8.32 16.09%

-8.3 3.67 97.15%

-12.5 1.20 99.33% -7.4 1.34 98.83%

-89.2 0.77 99.62% -31.5 2.71 99.24%

-12.5 1.20 99.72% -89.2 0.77 99.93%

97

98

49

~"""".'~"~'~''''ARC~ .«' ... '''ee ........ ' ... u ..... Q

~"" ...... ' .. '~ayQ' ... ~"" """""'OCO"'""><>e,,,,,A

Hierarchical Docking

Flexible docking: 27 confs x3 atoms 81 atom positions

Hierarchical docking: 27 confs 3C+3A+9B 15 atom positions


Correcting for Ligand Solvation Energies I:i~~

ilGbind = ilGinteract - ilGso1v, L - ilGso1v, R

ilGinteract = L(qi Pi + ViP v)

ilGe1ec,solv = (q2I2r) (lIDo - llDw)

= (lIDo - llDw)/2r LLQi8qj ~p = -621.48 - 25.890 x area

• • • 8

• Q 8

· e · 8. / ~.-8q •• • • • -8q

99

! -

100

50

-'"

~

~ .. " ... c.".u.~, ..... c~ .«,"''''c<_ .... '"''''c",,, ..

100

10

·14 ·12

1000

100

10

1 ·14 ·12

Solvation Corrections: Thymidylate Synthase Screen

I I

·10 ·8 ·6 ., ·2

Net Charge on Molecule i i I

·10 ·8 ·6 ., ·2

Net Charge on Molecule

su!.r1mputing 99-Portland

~ .. '''' ... < 'H'~~"~"OA~C~ ~c>"n",cc"",,,,,,,,c,,,,u

1000

100

'0

.,

i

t

Solvation Corrections: DHFR Screen

I I Net Charge on Molecule

t Net Charge on Molecule

~ Supercomputing 99~Portland

101

102

51

Solvation Corrections: Benzene Cavity Screen ~." ............ Y~ • ..-,~C ..

<c"", .. ,ee~ ... """" C< .... ,n

16£

100

-2


1000

100

10

-2


Supercomputing 99-Portland 103

~ .. " .... , ..... ~y~ .... ~c .. Hit Rates oc" .... '<'ec ..... "" ... «""n

-Supercomputing 99-Portland 104

52

~.,,< .... , .. ,,~~v~,.u~c~ l<"t ..... ' .. cC""""~" ... ".m<A

Unmet Challenges

• Better Scoring • context dependent desolvation

• receptor desolvation

• better force-fields

• Receptor Flexibility

• Cominatorial Chemistry

• This work supported by the NIH, Genetics Institute, and Procter & Gamble

.... ''' ... ''~,Uy~'' ... ACw .e.,,,,,"Oc_C .. ..,«""A


Cellular Network Analysis

Adam Arkin

Physical Biosciences


Bioengineering and Chemistry

University of California, Berkeley

11/15/99

"Supercomputing 99-Portlnnd

105

106

53

M.tt< ... L'M"~~'''''''~~H .<>< ...... lCe~ .... ""' ... e."'U

Engineering of Cellular Circuitry

Asynchronous Digital Telephone Switching Circuit

Full knowledge of parts list Full knowledge of "device physics" Full knowledge of interactions

Asynchronous Analog Biological Switching Circuit

Partial knowledge of parts list Partial knowledge of "device physics" Partial knowledge of interactions

No one fully understands how this circuit works!! Its just too complicated.

No one fully understands how this circuit works!! Its just too complicated.

Designcd and prototyped on a computer (SPICE analysis) Experimental implementation fault tested on computer

We /leed a SPICE-like analysis for biological systems

M'''''''.'M'~~Y~''''RC'' HI< • .,.,nCC ....... L~I""«Nf.~


Analysis of Cell Function

The challenge is to integrate data from all levels to produce a description of cellular function.

There are challenges in:

Systematization and structuring of data Serving and query this data Representing the data Building multiscale, multiresolution models Dynamic and static analysis of these models

Pay-off in Industrial bioengineering Rational pharmaceutical design Basic biological understanding

.Supercomputing 99-Porthmd 108

54

Tdla

~"""'C .~ •••• ~._nCM ."'("'.,,~~-"', ... ~""".

': .. Jfi·JO .. ' .... ~""'''C'H'~~' ., ..... c .. «"t<T"'~~OI"""_«""U

~~~""PI"'.P.rt.1"'1192

,-~ ThI.EPSplo;ltnl ..... "...uvod

~~;:vIOWlrclcododlnll

lhfsEPSplc;I .... ~np<lnlIO. Po.ISWplpr1n1.',W"...lo clherlypeoolprlnl ...

Complexities of Cellular Function

Supercomputing 99MPortiand

Comrnlmtfi ~Ospon,.("'I"" \"\ sp.~ \~ \ spolIA.

\ Spd1a..

"~ s~ollO

[09

Complexities of Cellular Function ~~~1;1 ~-

Spatiotemporally resolved pictures of developmental processes take up Gigs of storage.

Analyses takes days-weeks.

Models are in early days.

Each of those little bright spots contains networks vastly more complicated than those on the last slide!

~ Supercomputing 99~PortlDnd llO

55

~'''<>'''Ct"'.G'.'''''RCK te,."",nee_",',,"ct""'Q

~"" ... C." ••• ,".",""~ <<0< .... ' .... <0_\..'''"'' ••. , ...

Heterogeneity of Data

Data are:

1) Qualitative-->Quantitative 2) Collected at many levels 3) Of heterogeneous structure 4) Of heterogeneous availability

Challenge:

Optimal use of available data to make predictions about cell function and fail urc.


Tools for "multilevel" analysis

:f " t,~

Cellular networks

!t" t~ > '&

'It ~

Physical prop~rties

I" t~· ti Finding Parts

,Supercomputing 99-Porlland

III

56

r.,rmm Why now? '.~~li1

.... " ... L .............. c~ 00",...,'''<0« ............ ''' •• ~-

-Genome projects are providing a large (but partial) list of parts

-New measurement technologies are helping to identify further components, their interactions, and timings

- Gene micro arrays - Two-Hybrid library screens - High-throughput capillary electrophoresis arrays for DNA, proteins and metabolites - Fluorescent confocal imaging of live biological specimens - High-throughput protein structure determination

-Data is being compiled, systematized, and served at an unprecedented rate - Growth of GenBank and PDB > polynomial - Proliferation of databases of everything from sequence to confocal images to literature

-The tools for analyzing these various sorts of data are also multiplying at an astounding rate


.. A"' ... , .............. CK .c,,"·"".c_ .. ~,..,ct ...... SPICE Tools for Biology?

Rio/Spice: A Web-Servable, Biologist-Friendly, database, analysis and simulation interface was developed into a true beta product. '

Interfaces to ReactDB, MechDB, and ParamDB.

With Kernel, perfornls basic: flux-balance analysis, stochastic and deternlinistic kinetics, Scientific Visualization of results.

Notebook/Kernel design optimized for distributed computing.


IJ3

57

.... ,,....c."'~ ••• , .. u"" .m"",.-, .. <_I.O''''«NTU Components of Bio/Spice


"A"'~"""'''~Y''''''AMC'' O<""''''C._~'''''«'''U

An Example of "Device Physics"

Exponontlal distribution of Intertranscrlpt tlmos

~"" __ "'_!I!

SuccessIve competitions botwoon RNase and ribosomes· Geometric distribution of numbor of proteins par transcript

70.-------------------,

Simulation methodology for full-up simulation of chemical Markov-Process scales exponentially with number of reactions

20

10

-Supercomputing 99-Portland

Time (minutes)

116

58

~"""U'N'~GY"<"'''C~ O<1",,...«C"""',,''''«'''''' Complexities of Cellular Function

This is approximately 1/3 of just the initiation of the sporulation program from Bacillus subtilis.

There are over 100 proteins, 40 genes, 300 reactions for which data is available.

The total data on just this process is a tens of Gigs and it is incomplete. Microarray and microscope data are added 100 Megs per week. Model builders need to query this data and arrange it for simulation. Simulations must be run under many different condition and hypotheses.


The Need for Advanced Computing N .. " .... L.N«GV .. ' .... Rcw .co< .... ,"ec"""~.""C<><ru

Data Handling: The total data necessary for network analysis is huge. By nature it will be distributed and heterogeneous We need:

Database standard and new query types Means of secure,fast transmission of information Means of quality control on data input

Tool integration: Centralization of computational biology tools and standards Ability to use tools together to generate good network hypotheses Good quality ratings on Tool outputs

Advanced Simulation Tools: Fast, distributed algorithms for dynamical simulation

Mixed mode systems (differential, Markov, algebraic, logical) Spatially distributed systems

... Supercomputing 99-Porrland

117

liS

59

Computational Biology and High Performance Computing/67531/metadc709078/m2/1/high_re… · Computational Biology and High Performance Computing Manfred Zorn, Teresa Head-Gordon, Adam

Documents