275

��

��

Protein structure predictionmethods for drug designThomas Lengauer and Ralf ZimmerDate received (in revised form): 4th July 2000

Abstract��

��

Thomas Lengauer,Institute for Algorithms andScientific Computing (SCAI),GMD – National ResearchCenter for InformationTechnology,Sankt Augustin,Germany D53754.

Tel: +49 2241 14 2776/2777Fax: +49 2241 14 2656E-mail: [email protected]

�� !�� "��#��$%%&�� !�� '�#�� #(�)�� *��(�+�� #�� #�� (

�� (�+�� #��#�� #��(�+�� #�� #�� ,�� # ��#�� #�� #�-��'�� #.(

INTRODUCTIONThe long path from genomic data to anew drug can conceptually be dividedinto two parts (see left side of Figure 1).The first task is to select a target proteinwhose molecular function is to bemoderated, in many cases blocked, by adrug molecule binding to it. Given thetarget protein, the second task is toselect a suitable drug that binds to theprotein tightly, is easy to synthesise, isbio-accessible and has no adverse effectssuch as toxicity. The knowledge of thethree-dimensional structure of a proteincan be of significant help in both phases.The steric and physicochemicalcomplementarity of the binding site ofthe protein and the drug molecule is animportant, if not the dominating, featureof strong binding. Thus, in many cases,the knowledge of the protein structureaffords well-founded hypotheses of thefunction of the protein. If the structureof the relevant binding site of theprotein is known in detail, we can evenstart to employ structure-based methodsin order to develop a drug bindingtightly to the protein.

In this paper bioinformatics methodsfor prediction aspects of the proteinstructure are described and their usetowards the goal of drug design isdiscussed. The possibilities and limitationsof using protein structure knowledgetowards the goal of developing new drugtherapies are also discussed.

NOTIONS OF PROTEINFUNCTIONThe increased accessibility of genomicdata and, especially, that of large-scaleexpression data has opened newpossibilities for the search for targetproteins. This development hasprompted large-scale investments intothe new technology by manypharmaceutical companies. Therespective screening experiments relycritically on appropriate bioinformaticssupport for interpreting the generateddata. Specifically, methods are requiredto identify interesting differentiallyexpressed genes and to predict thefunction and structure of putative targetproteins from differential expression datagenerated in an appropriate screeningexperiment.

Protein function is a colourful notionwhose meaning can range over severallevels:

● a very general classification (globular,enzyme, hormone, structural protein,viral capsid protein, transmembraneprotein, etc.);

● biochemical function (biochemicalreaction, enzyme specificity, bindingpartners, cofactors);

● classification via broad cellular function(interaction with DNA and otherproteins, cellular localisation);

08-lengauer.p65 9/19/00, 1:49 PM275

��

��

● broad phenotypic function (changesobserved for organisms with deleted ormutated genes);

● identification of detailed physiologicalfunction such as the localisation in ametabolic or regulatory pathway andthe associated cellular role of theprotein;

● identification of molecular bindingpartners and their mode of interactionwith the protein.

The derivation of protein function fromprotein sequence by theoretical means iscommonly performed by transferringfunctional information from relatedproteins (eg from other organisms).Usually the transfer is from proteinswhose function has been established withexperimental evidence. The establishmentof the relevant protein relationship basedon sequence is complicated by somesubtleties of evolutionary processes.

Though it is often true that organismsshare related proteins with similarsequence, similar structure and the same

function simply because they originatefrom a common ancestor and they stillfulfil their role within the cellularprocesses, mutations occur independentlyafter speciation events. Depending on theextent of the evolutionary changes, therecognition of homology or orthologyamong proteins can be difficult, but stillin these cases consistent evidence forrelatedness should be expected on thesequence, structure and function levels.

Sometimes, the situation is complicatedbecause of gene duplications within aspecies leading to paralogous copies of thesame gene. These paralogous copies aresubject to evolutionary changes and theevolutionary pressure on structure orfunction is much relaxed for all but onecopy, which still serves the originalpurpose, such that greater deviations insequence, structure and function occur forthese copies. As still considerable, iesignificantly more than random, sequencesimilarity among paralogous proteins canbe observed, this messes up the situation,leading to erroneous transfer of functionsto already functionally disabled orfunctionally completely different proteins.

��

IDENTIFY

SEARCH

MODEL

DESIGN

Target Lead Structure / Drug

Assay/Screening

Rational Drug Design

CombinatorialLibraries

ComputerHTS

DockingLigandDesign

Target Protein Function

HTS Trial&Error

Tar

get

Pro

tein

Sea

rch

Dru

g L

ead

S

earc

h

Target Protein Structure

Structure

Structure

Target Protein

Sequence MotifsFusion Co-ExpressionCo-Evolution

FamiliesPara-/Analogs

EvolutionaryInformation

Expression PhenotypGenotyp

SNPs, LinkageMutations

Structure

Genome/Organism/Disease

08-lengauer.p65 9/19/00, 1:49 PM276

��

��

Therefore, in the following, we have todistinguish between three notions.Similarity is a quantitative measure on thesequence, structure or function level.Homology is used when there is a clearestablished or potential (assumed,predicted) evolutionary relationshipbetween proteins. The term orthology, inaddition, indicates homologous proteinswith (established or potential) the sameor at least similar function. The notion ofparalogy, in contrast, is used, whenhomologous proteins are expected tohave evolved enough to expect changesin function (with or without a change in3D structure).

For drug design, we need to knowmore of the function of the protein thanfollows from just a general classification.It would be best both to know naturalbinding partners and to have a detailedstructural model of the binding sites ofthe protein.

METHODS OFPREDICTING PROTEINFUNCTIONThere are a number of ways to predictprotein function from sequence. Most ofthem are based on sequence similarity. Alarge database of protein sequences isscreened for ‘model sequences’ thatexhibit a high level of similarity to thequery protein sequence. Sequencealignment tools such as BLAST1 andPSI-BLAST2 are the work-horses of suchanalyses. If one or more model sequencesare found that exhibit a sufficiently highlevel of similarity to the query sequenceand about whose function we have someknowledge, then conclusions may bepossible on the function of the querysequence. If the homology is above, say,40 per cent and functionally importantmotifs are conserved then we canhypothesise that the query sequence has afunction that is quite similar to that ofthe model sequence. As the level ofsimilarity decreases, the conclusions onfunction that can be drawn fromsequence similarity become less and lessreliable. Classifications of proteins into

families that form clusters of structurallyor functionally related proteins are helpfulin the prediction of protein function inthese cases. There are several proteinclassifications available on the internetthat can serve for this purpose

● COGS3,4

● ProDom5

● PFAM6

● SMART7,8

● PRINTS9

● Blocks+10

● ProtoMap11,12

A number of these databases (Pfam,PROSITE, PRINTS, ProDom,SWISSPROT+TREMBL) are currentlybeing united in the InterPro13 database.Since protein function is basically tied toprotein domains, protein domain analysisis an integral part of the methodologythat leads to protein family databases.14–22

Since only 20–40 per cent of theprotein sequences in a genome such asMycoplasma genitalium, M. janaschii and M.tuberculosis have significant sequencesimilarity to proteins of knownfunction,23,24 we need to be able to makeconclusions on the function of proteinsthat exhibit no significant sequencesimilarity to suitable model proteins. Asthe similarity between query sequenceand model sequence decreases below athreshold of, say, 25 per cent, safeconclusions on a common evolutionaryorigin of the query sequence and themodel sequence can no longer be made.However, it turns out that, in many cases,the protein fold can still be reliablypredicted, and in several cases evendetailed structural models of proteinbinding sites can be generated. Thus,especially in this similarity range, proteinstructure prediction – again together withthe identification of conserved sequence

similarity

homology

orthology

paralogy

BLAST

PSI-BLAST

protein classifications

08-lengauer.p65 9/19/00, 1:49 PM277

��

��

or spatial motifs – can help to ascertainaspects of protein function.

Other sources of information besidesequence similarity have been explored inorder to gain insight into proteinfunction. These methods are representedby five arrows pointing downwards in thetop right part of Figure 1. The followingcomments on these methods apply in theorder from left to right:

● Sequence alignment has long been usedfor ascertaining protein function. This isthe standard method and wecommented on it above. This approachis only reliable if there is high sequencesimilarity such that we can argue aboutorthologous proteins, since we knowthe function of one of the proteins.

● Recently, the Rosetta stone method hasbeen introduced. This method uses over20 completely sequenced genomes andanalyses evolutionary correlations of twodomains being fused into one protein inone species and occurring in separateproteins in another species. From theseclassifications the method establishespairwise links between functionallyrelated proteins25 and elicits putativeprotein–protein interactions.26

● For the same purpose, the phylogeneticprofile method analyses the co-occurrence of genes in the genomes ofdifferent organisms.27

● The analysis of change of phenotypebased on mutated genes (eg by knock-out experiments) yields importantinformation on aspects of proteinfunction.28–30

● In the future, the analysis of geneticvariations31 among individuals, eg singlenucleotide polymorphisms (SNPs),32–34

will be helpful in ascertaining proteinfunction beyond mere disease linkage orassociation (right arrow in Figure 1).35–38

None of these methods looks at proteinstructures, and thus we do not discuss

them in more detail here. While thesemethods are reported to generatesignificant insight into protein function ona higher level and to point to putativetarget proteins,39 in the end, drug designcan be expected to necessitate structuralknowledge of either the target protein orits binding partners.

METHODS FORPREDICTING PROTEINSTRUCTUREIn the authors’ view, computationalmethods for predicting protein structurefrom sequence alone are still well out ofrange, although, there are recentmethodical advances – sometimes calledmini-threading – that are based on theassembly of fragments (see egROSETTA40). In contrast, modellingprotein structures after folds that have beenseen before has become quite a powerfulmethod for protein structure prediction.Here, the query sequence is aligned(threaded) to a model sequence whosethree-dimensional structure is known (thetemplate protein). All proteins in a givenprotein structure database – usually, anappropriate representative set of structuresare tried — and each template is rankedusing heuristic scoring functions. Thescore reflects the likelihood that the querysequence assumes the template structure.The approach of modelling a proteinstructure after a known template is calledhomology-based modelling and the selectionof a suitable template protein is often donevia protein threading.

Protein threading has three majorobjectives: first, to provide orthogonalevidence of possible homology fordistantly related protein sequences;second, to detect possible homology incases where sequence methods fail; andthird, to improve structural models forthe query sequence via structurally moreaccurate alignments.

There are several successful proteinthreading methods, including:

● methods based on hidden Markovmodels;41–48

sequence alignment

homology-basedmodelling

protein threading

08-lengauer.p65 9/19/00, 1:49 PM278

��

�� !

● dynamic programming methods basedon profiles;49–51

● environment compatibility (ie contactcapacity potentials as used in theprotein threader 123D).52

These programs are very fast. A mid-sizeprotein sequence can be threaded againsta database of about 1,500 proteinstructures in a few minutes on a PC orworkstation. However, the underlyingmethods assume that the assignment ofchemical properties to spatial regions inthe protein is the same in the queryprotein and the template protein. This isnot the case, in practice, especially if onecompares proteins with partly differentfolds or different functions. Extensions ofthe homology-based modelling approachto proteins with very similar proteinstructures but different chemical make-up require the solution ofalgorithmically provably hard problemsand thus necessitate much morecomputing time. There are:

● heuristic approaches based on distance-based pair potentials of mean force;53–56

● optimal or approximate combinatorialtree search techniques.57–59

Such approaches need hours to thread aprotein through a database of 1,500templates. However, they can yield moreaccurate alignments and models ofbinding pockets of proteins.

The process of protein threadingselects a suitable template protein for aprotein query sequence and computes analignment of the backbone of the twoproteins that is the starting point forgenerating a structural model for thequery protein based on the structure ofthe template protein. What is left is toplace the side chains of the query proteinand to model the loops of the queryprotein that are not modelled by thetemplate structure. These two tasks areperformed by homology-basedmodelling tools such as:

● Modeller60–64 and ModBase;65

● Swiss-Model;66,67

● or commercial versions included inQuanta (MSI) or Sybyl (Tripos, Inc.).

For protein side-chain modelling thereare two contrasting approaches based onknowledge deduced from structuraldatabases and methods such as energyminimisation and molecular dynamics,68

respectively. Methods based on side-chainrotamer libraries that have been createdvia the analysis of the protein structuredatabase are usually employed to get afirst model. Energy minimisation ormolecular dynamics69 is often used torefine the model. Such methods havebeen in use for crystallography/nuclearmagnetic resonance (NMR) for manyyears and are available in several programpackages and tools (Charmm,70

GROMOS/GROMACS71,72 and manyothers73,74). In general these methods arequite computer-intensive and can onlybe exercised on one or a few proteins.Generally, the backbone alignment is aninput to homology-based modelling toolsand the quality of the derived models ishighly sensitive to the accuracy of theprovided alignments.

Loops are modelled by a related host ofmethods. Loops that involve more thanabout five residues are still hard tomodel.75–78

The evaluation of the accuracy ofassigning a protein fold (general proteinarchitecture) to a query sequence iscommonly based on generally acceptedfold classifications such as SCOP79 orCATH.80 The quality of backbonealignments is much harder to rate, and nogenerally accepted scheme is available, asof today.81–84 Rating the quality ofprotein structure models is generallybased on the root mean square (rms)deviation of the model and the actualstructure on a selected set of residues.The problem here is that the model mustbe superposed with the actual structure.There are several tools that perform this

side-chain modelling

loop modelling

quality assurance

08-lengauer.p65 9/19/00, 1:49 PM279

��

��

task – DALI/FSSP,85,86 SSAP,87 VAST,88

PROSUP89 or SARF90 – and they canyield different results. Thus, there is noaccepted gold standard for proteinstructure superposition. However, for thepurpose of rating the structures of targetproteins, the available superpositionmethods are sufficient.

PERFORMANCE OFPROTEIN STRUCTUREPREDICTION METHODSThere are strong efforts to render thequality of protein structure predictionmethods more transparent and easier toevaluate. The centre of these efforts is thebi-annual CASP experiment, which ratesprotein structure prediction methods onblind predictions and aims at developingstandardised and generally agreed uponassessment procedures both for foldidentification and the evaluation ofalignment accuracy as well as homologymodels. A blind prediction is a predictionof the three-dimensional structure for aprotein sequence at a time, at which theactual structure of the protein is notknown (yet). After the structure has beenresolved, the prediction is compared withthe actual structure. There have beenthree issues of the CASP experiment;91

the fourth one follows this year. TheCASP experiment has been a significanthelp in providing a more solid basis forassessing the power of different proteinstructure prediction methods.

For fold recognition, detectableprogress has been observed from CASP1to CASP2. In CASP3, similarperformance as in CASP2 was achievedon more difficult targets. There appears tobe a certain limit of current foldrecognition methods, which is still wellbelow the limit of detectable structuralsimilarity (via structural comparisons). Inaddition, in CASP3 several groupsproduced reasonable models of up to 60residues for ab initio target fragments.

In CASP3 from 43 protein targets, 15could be classified as comparativehomology modelling targets, ie relatedfolds and accompanying alignments could

be derived beyond doubt. For more thanhalf of the 21 more difficult casesreasonable models could be predicted byat least one of the participating predictionteams. In addition, the CAFASPsubsection of the assessment hasdemonstrated that 10 out of 19 foldscould be solved via completely automaticapplication of the best threading methodswithout any manual intervention.

Methods for refining rough structuralmodels towards the true native structureof the query protein are also notstraightforward. This is an active area ofresearch.92

A combination of protein threadingfollowed by homology-based modellingcannot create genuinely novel proteinstructures. But it turns out to be quitesensitive in creating structure modelsbased on known folds. Models that havebeen reasonably accurate (eg down to1.4Å for some 60 amino acids of theactive site of herpes virus thymidinekinase93) have been reported in blindstudies of proteins with a sequenceidentity to the template protein of as lowas 10 per cent. Correct folds can beassigned in many cases, even if the querysequence and the suitable templateexhibit a very low level of sequencesimilarity (down to 5 per cent, ie farbelow the level of random sequencesimilarity of 17–18 per cent in optimalalignments).

STRUCTURAL GENOMICSThe goal of structural genomics projectsis to solve experimental structures of allmajor classes of protein foldssystematically independent of somefunctional interest in the proteins.94,95 Theaim is to chart the protein structure spaceefficiently; functional annotations and/orassignment are made afterwards. Thisaffords a thoroughly thought-out strategyof mixing experimental protein structuredetermination, eg via X-ray, withcomputer-based protein structureprediction. The experiments have to yieldnovel protein structures. The proteins tobe resolved experimentally are again

CAFASP

predition assessment

CASP

structure space

08-lengauer.p65 9/19/00, 1:49 PM280

��

��

selected by computer. The computer partdeduces the remaining structures basedon homology-based modelling andprotein threading. One goal of the overallstructural genomics endeavour is to havean experimentally resolved proteinstructure within a certain structuraldistance to any possible protein sequence,which allows for computing reliablemodels for all protein sequences.

Once a map of the protein structurespace is available, this knowledge shouldprovide additional insights on what thefunction of the protein in the cell is andwith what other partners it mightinteract. Such information should add toinformation gained from high-throughput screening and biologicalassays. So far, glimpses of what will bepossible could be obtained by analysingcomplete genomes or large sets ofproteins from expression experimentswith the structural knowledge availabletoday, ie more or less completerepresentative sets and a quite coarsecoverage of structure space.63,96,97

METHODS FORPREDICTING PROTEINFUNCTION FROMPROTEIN STRUCTUREAspects of protein structure that areuseful for drug design studies typicallyhave to involve three-dimensionalstructure. Predicting the secondarystructure of the protein is not sufficient.Even the similarity of the three-dimensional structures of two proteinscannot be taken as an indication for asimilar function of these proteins. Thereason is that protein structure isconserved much more than proteinfunction. Indeed, protein folds such as theTIM barrel (triose-phosphate isomerase)are quite ubiquitous and can beconsidered as general scaffolds that lendmolecular stability to the protein and arenot directly tied to its function. Incontrast, the molecular function of theprotein is tied to local structuralcharacteristics pertaining to bindingpockets on the protein surface. These

characteristics are imprinted onto theprotein structure by specific patterns ofamino acid side chains that make up thebinding pocket. The conservation ofthese amino acids is what makes twoproteins have the same function. Sincenature varies sequence quite flexibly, thislevel of conservation is only maintainedamong orthologous proteins that exhibita high level of sequence similarity.

Thus, if the template protein fromwhich we predict protein structure is notorthologous to the query protein, othermethods of function prediction have tocome to bear. It is quite natural toconsider conservation patterns in theprotein sequence here, such as exhibitedin databases containing functionalsequence motifs such as PROSITE. Analternative that has been investigatedmore recently is to analyse conservationin 3D space.98 Experience shows thatsuch ‘structural’ motifs provide moreinformation than motifs derived purelyfrom sequence, even if the sequencemotifs are distributed over several regions(BLOCKS+, PRINTS). Recently, thenotion of an approximate structural motifhas been introduced – sometimes calledfuzzy functional form (FFF).99 Using alibrary of approximate structural motifsenhances the range of applicability ofmotif search at the price of reducedsensitivity and specificity. Suchapproaches are supported by the fact that,often, binding sites of proteins are muchmore conserved than the overall proteinstructure (eg bacterial and eukaryoticserine proteases), such that an inexactmodel can have an accurately modelledpart responsible for function. As thestructural genomics projects produce amore and more complete picture of theprotein structure space, comprehensivelibraries of highly discriminativestructural motifs can be expected.

The relationship between structure andfunction is a true many-to-many relation.Recent studies have shown thatparticular functions could be mountedonto several different protein folds100 and,conversely, several protein fold classes can

functional motifs

structural motifs

fuzzy functional forms

08-lengauer.p65 9/19/00, 1:49 PM281

��

��

perform a wide range of functions.101

This limits our potential of deducingfunction from structure. But knowledgeon which folds support a given functionand which functions are based on a givenfold can still help in predicting functionfrom structure. In addition, localstructural templates such as FFFsindicative for a particular function canidentify similar sites and the associatedfunction despite a globally different fold.Such 3D patterns can also discriminateamong globally similar folds with respectto containing particular conserved 3Dfunctional motifs in order to classify theminto different functional categories.

Though it is not easy to derivefunctions from resolved proteinstructures, the availability of structuralinformation improves the chancescompared with relying on sequencemethods alone.

METHODS FORDEVELOPING DRUGSBASED ON PROTEINSTRUCTUREThe object of drug design is to find ordevelop a, mostly small, drug moleculethat tightly binds to the target protein,moderating (often blocking) its functionor competing with natural substrates ofthe protein. Such a drug can be bestfound on the basis of knowledge of theprotein structure. If the spatial shape ofthe site of the protein is known, to whichthe drug is supposed to bind, thendocking methods can be applied to selectsuitable lead compounds that have thepotential of being refined to drugs. Thespeed of a docking method determineswhether the method can be employed forscreening compound databases in the

search for drug leads. A docking methodthat takes a minute per instance can beused to screen up to thousands ofcompounds on a PC or hundreds ofthousands of drugs on a suitable parallelcomputer. Docking methods that take thebetter part of an hour cannot be suitablyemployed for such large-scale screeningpurposes. In order to screen really largedrug databases with several hundredthousand compounds docking methodsthat can handle single protein/drug pairswithin seconds are needed.

The high conformational flexibility ofsmall molecules as well as the subtlestructural changes in the protein bindingpocket upon docking (induced fit) aremajor complications in docking.Furthermore, docking necessitates carefulanalysis of the binding energy. The energymodel is cast into the so-called scoringfunction that rates the protein–ligandcomplex energetically. Challenges in theenergy model include the handling ofentropic contributions, and solvationeffects, and the computation of long-range forces in fast docking methods.

The state of the art in docking can besummarised as follows (see also Table 1).Handling the structural flexibility of thedrug molecule can be done within theregime up to about a minute permolecular complex on a PC (see, eg,Kramer et al.102). A suitable analysis of thestructural changes in the protein stillnecessitates more computing time.

Today, tools that are able to dock amolecule to a protein within seconds arestill based on rigid-body docking (boththe protein and ligand conformationalflexibility is omitted).

Recently, fast docking tools have beenadapted to screening combinatorial drug

��

�� /�� '�� '��#�

/�0�� #�� 1 1

/�0�� #�� 1

2��#�� 3��# /��

docking

drug screening

scoring function

structural flexibility

08-lengauer.p65 9/19/00, 1:49 PM282

��

��

libraries (see, eg, Rarey and Lengauer103).Such libraries provide a carefully selectedset of molecular building blocks togetherwith a small set of chemical reactions thatlink the modules. In this way, acombinatorial library can theoreticallyprovide a diversity of up to billions ofmolecules from a small set of reactants.

The accuracy of docking predictionslies within 50–80 per cent ‘correct’predictions depending on the evaluationmeasure and the method. That means thatdocking methods are far from perfectlyaccurate. Nevertheless, they are veryuseful in pharmaceutical practice. Themajor benefit of docking is that a largedrug library can be ranked with respectto the potential that its molecules havefor being a useful lead compound for thetarget protein in question. The quality ofa method in this context can bemeasured by an enrichment factor. Roughly,this is the ratio between the number ofactive compounds (drugs that bindtightly to the protein) in a top fraction(say the top 1 per cent) of the rankeddrug database divided by the same figurein the randomly arranged drug database.State-of-the-art docking methods in themiddle regime (minutes per molecularpair), eg FlexX,104 achieve enrichmentfactors of up to about 15. Fast methods(seconds per pair), eg FeatureTrees,105

achieve similar enrichment factors, butdeliver molecules similar to knownbinding ligands and do not detect asdiverse a range of binding molecules.

Even if the structure of the proteinbinding site is not known, computer-based methods can be used to selectpromising lead compounds. Suchmethods compare the structure of amolecule with that of a ligand that isknown to bind to the protein, forinstance, its natural substrate.

Alternatives to docking for lead findinginclude high-throughput screening(HTS). This laboratory method allows fortesting the binding affinity of up to morethan several thousand compounds to thesame target protein in a day. Incomparison this method has the

advantage that it does not have to dealwith insufficiently powerful computermodels, at the expense of high laboratorycost and the absence of structuralknowledge on ‘why’ a compound bindsto the protein.

CONCLUSIONIn summary, the field is still in an earlystage of development. Ab initio proteinstructure prediction continues to be agrand challenge for which nocomprehensive solution is in sight. Thequality of fold prediction based onhomology rises and tools has reached thestage where one can generate confidentpredictions for soluble proteins that in asubstantial fraction (about half) of thecases provide significant threading hits inthe structure database. Protein threadingand homology-based prediction becomeespecially helpful in an environmentwhere the methods can be used inconcert with experimental techniques forstructure and function determination.Here, the prediction methods canexercise their strengths, which lie inbeing used interactively by experts andmaking suggestions that can be followedup by succeeding experimentation, ratherthan being required to provide provenfact. The process of going from structureto function is far from being automated.In a scenario that combines structureprediction methods withexperimentation, the step from structureto function can be performed in acustomised manner.

Protein structure prediction byhomology is definitely not yet a turn-keytechnology. But we can expect it to enterthe ‘production’ stage through theactivities in structural genomics. Still thefield of protein structure prediction isvery busy, generating the tools andprocesses for raising the number ofconfident structure predictions and theaccompanying estimates of significance.Problems for applying these results indrug design are not only that the modelsmay not be sufficiently accurate but alsothat the structures of many interesting

combinational library

enrichment factor

high-throughputscreening

08-lengauer.p65 9/19/00, 1:49 PM283

��

��

target proteins will not be accessible byhomology-based modelling, at all, forsome time to come. This includes thetherapeutically particularly interestingclass of membrane proteins, for whichessentially no structures have beenresolved.

Docking is used frequently instructure-based drug design. To theauthors’ knowledge, the first drugdeveloped with structure-basedtechniques was the HIV proteaseinhibitor Dorzolamide. In the past fewyears structural considerations have begunto pervade the design of new drugs. Apoint in case is that of the neuraminidaseinhibitors for HIV. Such studies mostlyinvolve experimentally resolved proteinstructures. However, even models canserve to guide drug development. Basedon the experimentally resolved structureof the membrane proteinbacteriorhodopsin, several groups areattempting to model binding sites of G-protein coupled receptors that arebelieved to be structurally similar.Nevertheless, the authors are not awareof any instance where the whole processline from the protein sequence to thelead structure has been exercised in anintegrated manner and with significanthelp of computer predictions. The fieldhas not reached this level of maturityyet. While structural aspects – even aspredicted by the computer – can beexpected to invade the search for targetproteins and the development of newdrugs, experimental data, where they areaccessible, will always be highly welcomeand often be indispensable in thisprocess.

AcknowledgementsWe thank Matthias Rarey for helpful comments onthis paper and Gerhard Barnickel and GerhardKlebe for information on the state of drugsdeveloped by structure-based techniques.

References

1. Altschul, S. F., Gish, W., Miller, W.et al. (1990), ‘Basic local alignment searchtool’, J. Mol. Biol., Vol. 215(3), pp. 403–410.http://ncbi.nlm.nih. gov/BLAST/

2. Altschul, S. F., Madden, T. L., Schaffer, A. A.et al. (1997), ‘Gapped BLAST and PSI-BLAST: a new generation of proteindatabase search programs’, Nucleic AcidsRes., Vol. 25(17), pp. 3389–3402. http://ncbi.nlm.nih.gov/blast/psiblast.cgi

3. Tatusov, R. L., Galperin, M. Y., Natale, D.A. and Koonin, E. V. (2000), ‘The COGdatabase: a tool for genome-scale analysisof protein functions and evolution’, NucleicAcids Res., Vol. 28(1), pp. 33–36.

4. Tatusov, R. L., Koonin, E. V. and Lipman,D. J. (1997), ‘A genomic perspective onprotein families’, Science, Vol. 278(5338),pp. 631–637.

5. Corpet, F., Servant, F., Gouzy, J. and Kahn,D. (2000), ‘ProDom and ProDom-CG: toolsfor protein domain analysis and wholegenome comparisons’, Nucleic Acids Res.,Vol. 28(1), pp. 267–269.

6. Bateman, A., Birney, E., Durbin, R. et al.(2000), ‘The Pfam protein familiesdatabase’, Nucleic Acids Res., Vol. 28(1),pp. 263–266.

7. Schultz, J., Milpetz, F., Bork, P. and Ponting,C. P. (1998), ‘SMART, a simple modulararchitecture research tool: identification ofsignaling domains’, Proc. Natl Acad. Sci.USA, Vol. 95(11), pp. 5857–5864.

8. Schultz, J., Copley, R. R., Doerks, T. et al.(2000), ‘SMART: a web-based tool for thestudy of genetically mobile domains’,Nucleic Acids Res., Vol. 28(1), pp. 231–234.

9. Attwood, T. K., Croning, M. D., Flower,D. R. et al. (2000), ‘PRINTS-S: thedatabase formerly known as PRINTS’,Nucleic Acids Res., Vol. 28(1), pp. 225–227.

10. Henikoff, S., Henikoff, J. G. andPietrokovski, S. (1999), ‘Blocks+: a non-redundant database of protein alignmentblocks derived from multiple compilations’,Bioinformatics, Vol. 15(6), pp. 471–479.

11. Yona, G., Linial, N. and Linial, M. (2000),‘ProtoMap: automatic classification ofprotein sequences and hierarchy of proteinfamilies’, Nucleic Acids Res., Vol. 28(1), pp.49–55.

12. Yona, G., Linial, N. and Linial, M. (1999),‘ProtoMap: automatic classification ofprotein sequences, a hierarchy of proteinfamilies, and local maps of the proteinspace’, Proteins, Vol. 37(3), pp. 360–378.

13. http://www.ebi.ac.uk/interpro/

14. Rose, G. D. (1979), ‘Hierarchic organizationof domains in globular proteins’, J. Mol.Biol., Vol. 134(3),pp. 447–470.

15. Nichols, W. L., Rose, G. D., Ten Eyck, L. F.and Zimm, B. H. (1995), ‘Rigid domainsin proteins: an algorithmic approach to

drugs developed withcomputer techniques

08-lengauer.p65 9/19/00, 1:49 PM284

��

��

their identification’, Proteins, Vol. 23(1),pp. 38–48.

16. Gracy, J. and Argos, P. (1998), ‘Automatedprotein sequence database classification. II.Delineation of domain boundaries fromsequence similarities’, Bioinformatics,Vol. 14(2), pp. 174–187.

17. Gracy, J. and Argos, P. (1998), ‘DOMO:a new database of aligned proteindomains’, Trends Biochem. Sci., Vol. 23(12),pp. 495–497.

18. Sowdhamini, R., Rufino, S. D. andBlundell, T. L. (1996), ‘A database ofglobular protein structural domains:clustering of representative family membersinto similar folds’, Fold Des., Vol. 1(3),pp. 209–220.

19. Jones, S., Stewart, M., Michie, A. et al.(1998), ‘Domain assignment for proteinstructures using a consensus approach:characterization and analysis’, Protein Sci.,Vol. 7(2), pp. 233–242.

20. Orengo, C. A., Martin, A. M.,Hutchinson, G. et al. (1998), ‘Classifying aprotein in the CATH database of domainstructures’, Acta Crystallogr. D Biol.Crystallogr., Vol. 54(1(Pt 6)), pp. 1155–1167.

21. Murzin, A. G. (1996), ‘Structuralclassification of proteins: newsuperfamilies’, Curr. Opin. Struct. Biol.,Vol. 6(3), pp. 386–394.

22. Murzin, A. G., Brenner, S. E., Hubbard, T.and Chothia, C. (1995), ‘SCOP: astructural classification of proteins databasefor the investigation of sequences andstructures’, J. Mol. Biol., Vol. 247(4),pp. 536–540.

23. Fischer, D. and Eisenberg, D. (1999),‘Predicting structures for genomeproteins’, Curr. Opin. Struct. Biol., Vol. 9(2),pp. 208–211.

24. Huynen, M., Doerks, T., Eisenhaber, F. et al.(1998), ‘Homology-based fold predictionsfor Mycoplasma genitalium proteins’, J. Mol.Biol., Vol. 280(3), pp. 323–326.

25. Marcotte, E. M., Pellegrini, M.,Thompson, M. J. et al. (1999), ‘A combinedalgorithm for genome-wide prediction ofprotein function’, Nature, Vol. 402(6757),pp. 83–86.

26. Marcotte, E. M., Pellegrini, M., Ng, H. L.,Rice, D. W. et al. (1999), ‘Detecting proteinfunction and protein-protein interactionsfrom genome sequences’, Science, Vol.285(5428), pp. 751–753.

27. Pellegrini, M., Marcotte, E. M., Thompson,M. J. et al. (1999), ‘Assigning proteinfunctions by comparative genome analysis:protein phylogenetic profiles’, Proc. NatlAcad. Sci. USA, Vol. 96(8), pp. 4285–4288.

28. Bork, P., Dandekar, T., Diaz-Lazcoz, Y.et al. (1998), ‘Predicting function: fromgenes to genomes and back’, J. Mol. Biol.,Vol. 283(4), pp. 707–725.

29. Roemer, K., Johnson, P. A. andFriedmann, T. (1991), Knock-in andknock-out: Transgenes, Development andDisease: A Keystone Symposiumsponsored by Genentech and Immunex,Tamarron, CO, USA, January 12–181991’, New Biol., Vol. 3(4), pp. 331–335.

30. Sato, T. N. (1999), ‘Gene trap, geneknockout, gene knock-in, and transgenicsin vascular development’, Thromb.Haemost., Vol. 82(2), pp. 865–869.

31. Collins, F. S., Guyer, M. S. andCharkravarti, A. (1997), ‘Variations on atheme: cataloging human DNA sequencevariation’, Science, Vol. 278(5343),pp. 1580–1581.

32. Brookes, A. J. (1999), ‘The essence ofSNPs’, Gene, Vol. 234(2), pp. 177–186.

33. Kuska, B. (1999), ‘Snipping “SNPs”: anew tool for mining gene variations’,J. Natl Cancer Inst., Vol. 91(13), p. 1110.

34. Vilain, E. (1998), ‘CYPs, SNPs,and molecular diagnosis in thepostgenomic era’, Clin. Chem., Vol. 44(12), pp. 2403–2404.

35. Collins, F. S. (1999), ‘Shattuck lecture –medical and societal consequences of theHuman Genome Project’, N. Engl. J.Med., Vol. 341(1), pp. 28–37.

36. Ellsworth, D. L. and Manolio, T. A. (1999),‘The emerging importance of genetics inepidemiologic research II. Issues in studydesign and gene mapping’, Ann.Epidemiol., Vol. 9(2), pp. 75–90.

37. Ellsworth, D. L. and Manolio, T. A.(1999), ‘The emerging importance ofgenetics in epidemiologic research III.Bioinformatics and statistical geneticmethods’, Ann. Epidemiol., Vol. 9(4),pp. 207–224.

38. Terwilliger, J. D. and Ott, J. (1994),‘Handbook of Human Genetic Linkage’,Johns Hopkins University Press,Baltimore.

39. Drews, J. (1996), ‘Genomic sciences andthe medicine of tomorrow’, Nat.Biotechnol., Vol. 14(11), pp. 1516–1518.

40. Simons, K. T., Bonneau, R., Ruczinski, I.and Baker, D. (1999), ‘Ab initio proteinstructure prediction of CASP III targetsusing ROSETTA’, Proteins, Vol. 37(S3),pp. 171–176.

41. Karchin, R. and Hughey, R. (1998),‘Weighting hidden Markov models formaximum discrimination’, Bioinformatics,Vol. 14(9), pp. 772–782.

08-lengauer.p65 9/19/00, 1:49 PM285

��

��

42. Bateman, A., Birney, E., Durbin, R. et al.(1999), ‘Pfam 3.1: 1313 multiple alignmentsand profile HMMs match the majority ofproteins’, Nucleic Acids Res., Vol. 27(1),pp. 260–262.

43. Park, J., Karplus, K., Barrett, C. et al.(1998), ‘Sequence comparisons usingmultiple sequences detect three times asmany remote homologues as pairwisemethods’, J. Mol. Biol., Vol. 284(4),pp. 1201–1210.

44. Barrett, C., Hughey, R. and Karplus, K.(1997), ‘Scoring hidden Markovmodels’, Comput. Appl. Biosci., Vol. 13(2),pp. 191–199.

45. McClure, M. A., Smith, C. and Elton, P.(1996), ‘Parameterization studies for theSAM and HMMER methods of hiddenMarkov model generation’, ‘Proc. 4thInternational Conference on IntelligentSystems for Molecular Biology’, AAAIPress, Menlo Park, CA, pp. 155–164

46. Eddy, S. R. (1998), ‘Profile hiddenMarkov models’, Bioinformatics, Vol. 14(9),pp. 755–763.

47. Sonnhammer, E. L., Eddy, S. R., Birney, E.,Bateman, A. and Durbin, R. (1998), ‘Pfam:multiple sequence alignments and HMM-profiles of protein domains’, Nucleic AcidsRes., Vol. 26(1), pp. 320–322.

48. Eddy, S. R. (1996), ‘Hidden Markovmodels’, Curr. Opin. Struct. Biol., Vol. 6(3),pp. 361–365. (1995), ‘Proc. 3rdInternational Conference on IntelligentSystems for Molecular Biology’, AAAIPress, Menlo Park, CA, pp. 114–120.

49. Bowie, J. U., Luthy, R. and Eisenberg, D.(1991), ‘A method to identify proteinsequences that fold into a known three-dimensional structure’, Science, Vol.253(5016), pp. 164–170.

50. Luthy, R., Bowie, J. U. and Eisenberg, D.(1992), ‘Assessment of protein models withthree-dimensional profiles’, Nature, Vol.356(6364), pp. 83–85.

51. Luthy, R., Xenarios, I. and Bucher, P.(1994), ‘Improving the sensitivity of thesequence profile method’, Protein Sci., Vol.3(1), pp. 139–146.

52. Alexandrov, N. N., Nussinov, R. andZimmer, R. M. (1996), ‘Fast protein foldrecognition via sequence to structurealignment and contact capacity potentials’,Pacific Symposium on Biocomputing,pp. 53–72.

53. Sippl, M. J. (1990), ‘Calculation ofconformational ensembles from potentialsof mean force. An approach to theknowledge-based prediction of localstructures in globular proteins’, J. Mol. Biol.,Vol. 213(4), pp. 859–883.

54. Hendlich, M., Lackner, P., Weitckus, S. et al.(1990), ‘Identification of native protein foldsamongst a large number of incorrectmodels. The calculation of low energyconformations from potentials of meanforce’, J. Mol. Biol., Vol. 216(1), pp. 167–180.

55. Sippl, M. J. (1995), ‘Knowledge-basedpotentials for proteins’, Curr. Opin. Struct.Biol., Vol. 5(2), pp. 229–235.

56. Sippl, M. J. and Flockner, H. (1996),‘Threading thrills and threats’, Structure,Vol. 4(1), pp. 15–19.

57. Lathrop, R. H. and Smith, T. F. (1996),‘Global optimum protein threading withgapped alignment and empirical pair scorefunctions’, J. Mol. Biol., Vol. 255(4),pp. 641–665.

58. Thiele, R., Zimmer, R. and Lengauer, T.(1999), ‘Protein threading by recursivedynamic programming’, J. Mol. Biol., Vol.290(3), pp. 757–779.

59. Xu, Y., Xu, D. and Uberbacher, E. C.(1998), ‘An efficient computational methodfor globally optimal threading’, J. Comput.Biol., Vol. 5(3), pp. 597–614.

60. Sali, A. (1995), ‘Modeling mutations andhomologous proteins’, Curr. Opin.Biotechnol., Vol. 6(4), pp. 437–451.

61. Sali, A., Potterton, L., Yuan, F. et al. (1995),‘Evaluation of comparative proteinmodeling by MODELLER’, Proteins,Vol. 23(3), pp. 318–326.

62. Sali, A. (1998), ‘100,000 protein structuresfor the biologist’, Nat. Struct. Biol., Vol.5(12), pp. 1029–1032.

63. Sanchez, R. and Sali, A. (1998),‘Large-scale protein structure modeling ofthe Saccharomyces cerevisiae genome’,Proc. Natl Acad. Sci. USA, Vol. 95(23),pp. 13597–13602.

64. Sanchez, R. and Sali, A. (1997), ‘Evaluationof comparative protein structure modelingby MODELLER-3’, Proteins, Suppl 1,pp. 50–58.

65. Sanchez, R., Pieper, U., Mirkovic, N. et al.(2000), ‘MODBASE, a database ofannotated comparative protein structuremodels’, Nucleic Acids Res., Vol. 28(1),pp. 250–253.

66. Guex, N., Diemand, A. and Peitsch, M. C.(1999), ‘Protein modelling for all’, TrendsBiochem. Sci., Vol. 24(9), pp. 364–367.

67. Guex, N. and Peitsch, M. C. (1997),‘SWISS-MODEL and the Swiss-PdbViewer: an environment forcomparative protein modeling’,Electrophoresis, Vol. 18(15), pp. 2714–2723.

68. Petrella, R. J., Lazaridis, T. and Karplus, M.(1998), ‘Protein sidechain conformer

08-lengauer.p65 9/19/00, 1:49 PM286

��

��

prediction: a test of the energy function’,Fold Des., Vol. 3(5), pp. 353–377.

69. Karplus, M. and Petsko, G. A. (1990),‘Molecular dynamics simulations inbiology’, Nature, Vol. 347(6294),pp. 631–639

70. Brooks, B. R., Bruccoleri, R. E., Olafson,B. D. et al. (1983), ‘CHARMM: A programfor macromolecular energy, minimization,and dynamics calculation’,J. Comp. Chem., Vol. 4, pp. 187–213.

71. Van Gunsteren, W. F. and Berendsen, H. J.(1982), ‘Molecular dynamics: perspectivefor complex systems’, Biochem. Soc. Trans.,Vol. 10(5), pp. 301–305.

72. Van Gunsteren, W. F. and Berendsen,H. J. (1990), ‘Moleküldynamik-Computersimulationen: Methodik,Anwendungen und Perspektiven inder Chemie’, Angew. Chem., Vol. 102,pp. 1020–1055.

73. Levitt, M. (1983), ‘Protein folding byrestrained energy minimization andmolecular dynamics’, J. Mol. Biol.,Vol. 170(3), pp. 723–764.

74. Novotny, J., Bruccoleri, R. and Karplus, M.(1984), ‘An analysis of incorrectly foldedprotein models. Implications for structurepredictions’, J. Mol. Biol., Vol. 177(4),pp. 787–818.

75. van Vlijmen, H. W. and Karplus, M. (1997),‘PDB-based protein loop prediction:parameters for selection and methods foroptimization’, J. Mol. Biol., Vol. 267(4),pp. 975–1001.

76. Lessel, U. and Schomburg, D. (1997),‘Creation and characterization of a new,non-redundant fragment data bank’, ProteinEng., Vol. 10(6), pp. 659–664.

77. Lessel, U. and Schomburg, D. (1999),‘Importance of anchor group positioningin protein loop prediction’, Proteins,Vol. 37(1), pp. 56–64.

78. Fechteler, T., Dengler, U. and Schomburg,D. (1995), ‘Prediction of protein three-dimensional structures in insertion anddeletion regions: a procedure for searchingdata bases of representative proteinfragments using geometric scoring criteria’,J. Mol. Biol., Vol. 253(1), pp. 114–131.

79. Lo Conte, L., Ailey, B., Hubbard, T. J. et al.(2000), ‘SCOP: a structural classification ofproteins database’, Nucleic Acids Res., Vol.28(1), pp. 257–259.

80. Orengo, C. A., Michie, A. D., Jones, S. et al.(1997), ‘CATH – a hierarchic classificationof protein domain structures’, Structure, Vol.5(8), pp. 1093–1108.

81. Marchler-Bauer, A. and Bryant, S. H.(1997), ‘Measures of threading specificityand accuracy’, Proteins, Suppl 1, pp. 74–82.

82. Marchler-Bauer, A., Levitt, M. and Bryant,S. H. (1997), ‘A retrospective analysis ofCASP2 threading predictions’, Proteins,Suppl 1, pp. 83–91.

83. Marchler-Bauer, A. and Bryant, S. H.(1997), ‘A measure of success in foldrecognition’, Trends Biochem. Sci., Vol. 22(7),pp. 236–240.

84. Lackner, P., Koppensteiner, W. A.,Domingues, F. S. and Sippl, M. J. (1999),‘Automated large scale evaluation ofprotein structure predictions’, Proteins, Vol.37(S3), pp. 7–14.

85. Holm, L. and Sander, C. (1998), ‘Dictionaryof recurrent domains in protein structures’,Proteins, Vol. 33(1),pp. 88–96.

86. Holm, L. and Sander, C. (1998), ‘Touringprotein fold space with Dali/FSSP’, NucleicAcids Res., Vol. 26(1), pp. 316–319.

87. Orengo, C. A. and Taylor, W. R. (1996),‘SSAP: sequential structure alignmentprogram for protein structure comparison’,Methods Enzymol., Vol. 266, pp. 617–635.

88. Gibrat, J. F., Madej, T. and Bryant, S. H.(1996), ‘Surprising similarities in structurecomparison’, Curr. Opin. Struct. Biol., Vol.6(3), pp. 377–385.

89. Lackner, P., Koppensteiner, W. A.,Domingues, F. S. and Sippl, M. J. (1999),‘Automated large scale evaluation ofprotein structure predictions’, Proteins, Vol.37(S3), pp. 7–14.

90. Alexandrov, N. N. (1996), ‘SARFing thePDB’, Protein Eng., Vol. 9(9), pp. 727–732.

91. Lattman, E. E. (ed.) (1999), ‘Third Meetingon the Critical Assessment of Techniquesfor Protein Structure Prediction’, Proteins,Vol. 37, Suppl. 3..

92. Kolinski, A., Rotkiewicz, P., Ilkowski, B.and Skolnick, J. (1999), ‘A method for theimprovement of threading-based proteinmodels’, Proteins, Vol. 37(4), pp. 592–610.

93. Zimmer, R. and Thiele, R. (1997), ‘Fastprotein fold recognition and accuratesequence–structure alignment’, in ‘GermanConference on Bioinformatics, GCB ’96’,Hofestädt, R., Lengauer, T., Löffler, M.and Schomburg, D. Eds, Springer, Berlin,pp. 137–148.

94. Kim, S. H. (1998), ‘Shining a light onstructural genomics’, Nat. Struct. Biol., Vol.5 Suppl., pp. 643–645.

95. Montelione, G. T. and Anderson, S. (1999),‘Structural genomics: keystone for aHuman Proteome Project’, Nat. Struct.Biol., Vol. 6(1), pp. 11–12.

96. Sali, A. (1998), ‘100,000 protein structuresfor the biologist’, Nat. Struct. Biol.,Vol. 5(12), pp. 1029–1032.

08-lengauer.p65 9/19/00, 1:50 PM287

��

��

97. Skolnick, J., Fetrow, J. S. and Kolinski, A.(2000), ‘Structural genomics and itsimportance for gene function analysis’,Nat. Biotechnol., Vol. 18(3), pp. 283–287.

98. Wallace, A. C., Laskowski, R. A. andThornton, J. M. (1996), ‘Derivation of3D coordinate templates for searchingstructural databases: application to Ser-His-Asp catalytic triads in the serineproteinases and lipases’, Protein Sci., Vol.5(6), pp. 1001–1013. http://www.biochem.ucl.ac.uk/bsm/PROCAT/PROCAT.html

99. Fetrow, J. S. and Skolnick, J. (1998),‘Method for prediction of protein functionfrom sequence using the sequence-to-structure-to-function paradigm withapplication to glutaredoxins/thioredoxinsand T1 ribonucleases’, J. Mol. Biol., Vol.281(5), pp. 949–968.

100. Kasuya, A. and Thornton, J. M. (1999),‘Three-dimensional structure analysis ofPROSITE patterns’, J. Mol. Biol., Vol.286(5), pp. 1673–1691.

101. Hegyi, H. and Gerstein, M. (1999),‘The relationship between proteinstructure and function: a comprehensivesurvey with application to the yeastgenome’, J. Mol. Biol., Vol. 288(1),pp. 147–164.

102. Kramer, B., Metz, G., Rarey, M. andLengauer, T. (1999), ‘Ligand docking andscreening with FlexX’, Med. Chem. Res.,Vol. 9(7/8), pp. 463–478.

103. Rarey, M. and Lengauer, T. (2000),‘A recursive algorithm for efficientcombinatorial library docking’, to appearin J. Comput. Aided Mol. Des.

104. Rarey, M., Kramer, B., Lengauer, T. andKlebe, G. (1996), ‘A fast flexible dockingmethod using an incrementalconstruction algorithm’, J. Mol. Biol.,Vol. 261(3), pp. 470–489.

105. Rarey, M. and Dixon, J. S. (1998), ‘Featuretrees: a new molecular similarity measurebased on tree matching’, J. Comput. AidedMol. Des., Vol. 12(5), pp. 471–490.

08-lengauer.p65 9/19/00, 1:50 PM288

275

Technology

protein target

structural protein

andtarget protein

information structure

structure analysis

structure levels

research new drug

protein structurethe