Top Banner
Using A Neural Network and Spatial Clustering to Predict the Location of Active Sites in Enzymes Alex Gutteridge, Gail J. Bartlett and Janet M. Thornton* EBI, Wellcome Trust Genome Campus, EMBL Outstation Hinxton, Cambridgeshire CB10 1SD, UK Structural genomics projects aim to provide a sharp increase in the num- ber of structures of functionally unannotated, and largely unstudied, pro- teins. Algorithms and tools capable of deriving information about the nature, and location, of functional sites within a structure are increasingly useful therefore. Here, a neural network is trained to identify the catalytic residues found in enzymes, based on an analysis of the structure and sequence. The neural network output, and spatial clustering of the highly scoring residues are then used to predict the location of the active site. A comparison of the performance of differently trained neural networks is presented that shows how information from sequence and structure come together to improve the prediction accuracy of the network. Spatial clustering of the network results provides a reliable way of finding likely active sites. In over 69% of the test cases the active site is correctly pre- dicted, and a further 25% are partially correctly predicted. The failures are generally due to the poor quality of the automatically generated sequence alignments. We also present predictions identifying the active site, and potential functional residues in five recently solved enzyme structures, not used in developing the method. The method correctly identifies the putative active site in each case. In most cases the likely functional residues are identified correctly, as well as some potentially novel functional groups. q 2003 Elsevier Science Ltd. All rights reserved Keywords: bioinformatics; structural genomics; functional prediction; neural network; active sites *Corresponding author Introduction The huge increase in the rate of DNA sequen- cing, and the use of gene prediction technologies, such as Genscan 1,2 and Genewise, 3 have flooded protein databases with new sequence data. The various structural genomics initiatives (SGI), now aim to produce a similar increase in the amount of structural information. 3 One of the most important tasks in biology today is to use these data to pro- vide functional annotation that leads to biologi- cally useful knowledge. 4 One type of information that structural data can provide is the location and nature of the functional regions of a protein, such as protein–protein inter- action sites and ligand binding pockets. Knowing the location of the functional sites within a protein allows the study of targeted mutants, structure- based drug design, and functional annotation of the protein by comparison with other characterised proteins. Most novel genes are functionally annotated using sequence analysis to find similar genes of known function, typically by running one of the various flavours of BLAST 5 or profile and HMM approaches such as Pfam. 6 Several studies 7–9 have pointed out the problems associated with hom- ology based functional annotation and indicate a cut-off of sequence identity as high as 40%, below which it is dangerous to transfer anything but the 0022-2836/03/$ - see front matter q 2003 Elsevier Science Ltd. All rights reserved Present addresses: A. Gutteridge, Birkbeck College, University of London, Malet St, Bloomsbury, London WC1E 7HX, UK; G. J. Bartlett, Department of Biochemistry and Molecular Biology, University College London, Gower St, London WC1E 6BT, UK. Abbreviations used: SGI, structural genomics initiatives; HMM, hidden Markov model; TIM, triose phosphate isomerase; RGS, regulator of G-protein signalling; ET, evolutionary trace; RSA, relative solvent accessibility; MCC, Matthews correlation coefficient; DOPS, diversity of position score; FEM, factors essential for methcillin resistance; PDB, Protein Data Bank. E-mail address of the corresponding author: [email protected] doi:10.1016/S0022-2836(03)00515-1 J. Mol. Biol. (2003) 330, 719–734
16

Using A Neural Network and Spatial Clustering to Predict the ...

Feb 10, 2017

Download

Documents

ngothu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Using A Neural Network and Spatial Clustering to Predict the ...

Using A Neural Network and Spatial Clustering toPredict the Location of Active Sites in Enzymes

Alex Gutteridge, Gail J. Bartlett and Janet M. Thornton*

EBI, Wellcome Trust GenomeCampus, EMBL OutstationHinxton, Cambridgeshire CB101SD, UK

Structural genomics projects aim to provide a sharp increase in the num-ber of structures of functionally unannotated, and largely unstudied, pro-teins. Algorithms and tools capable of deriving information about thenature, and location, of functional sites within a structure are increasinglyuseful therefore. Here, a neural network is trained to identify the catalyticresidues found in enzymes, based on an analysis of the structure andsequence. The neural network output, and spatial clustering of the highlyscoring residues are then used to predict the location of the active site.

A comparison of the performance of differently trained neural networksis presented that shows how information from sequence and structurecome together to improve the prediction accuracy of the network. Spatialclustering of the network results provides a reliable way of finding likelyactive sites. In over 69% of the test cases the active site is correctly pre-dicted, and a further 25% are partially correctly predicted. The failuresare generally due to the poor quality of the automatically generatedsequence alignments.

We also present predictions identifying the active site, and potentialfunctional residues in five recently solved enzyme structures, not used indeveloping the method. The method correctly identifies the putativeactive site in each case. In most cases the likely functional residues areidentified correctly, as well as some potentially novel functional groups.

q 2003 Elsevier Science Ltd. All rights reserved

Keywords: bioinformatics; structural genomics; functional prediction;neural network; active sites*Corresponding author

Introduction

The huge increase in the rate of DNA sequen-cing, and the use of gene prediction technologies,such as Genscan1,2 and Genewise,3 have floodedprotein databases with new sequence data. Thevarious structural genomics initiatives (SGI), nowaim to produce a similar increase in the amount of

structural information.3 One of the most importanttasks in biology today is to use these data to pro-vide functional annotation that leads to biologi-cally useful knowledge.4

One type of information that structural data canprovide is the location and nature of the functionalregions of a protein, such as protein–protein inter-action sites and ligand binding pockets. Knowingthe location of the functional sites within a proteinallows the study of targeted mutants, structure-based drug design, and functional annotation ofthe protein by comparison with other characterisedproteins.

Most novel genes are functionally annotatedusing sequence analysis to find similar genes ofknown function, typically by running one of thevarious flavours of BLAST5 or profile and HMMapproaches such as Pfam.6 Several studies7 – 9 havepointed out the problems associated with hom-ology based functional annotation and indicate acut-off of sequence identity as high as 40%, belowwhich it is dangerous to transfer anything but the

0022-2836/03/$ - see front matter q 2003 Elsevier Science Ltd. All rights reserved

Present addresses: A. Gutteridge, Birkbeck College,University of London, Malet St, Bloomsbury, LondonWC1E 7HX, UK; G. J. Bartlett, Department ofBiochemistry and Molecular Biology, University CollegeLondon, Gower St, London WC1E 6BT, UK.

Abbreviations used: SGI, structural genomicsinitiatives; HMM, hidden Markov model; TIM, triosephosphate isomerase; RGS, regulator of G-proteinsignalling; ET, evolutionary trace; RSA, relative solventaccessibility; MCC, Matthews correlation coefficient;DOPS, diversity of position score; FEM, factors essentialfor methcillin resistance; PDB, Protein Data Bank.

E-mail address of the corresponding author:[email protected]

doi:10.1016/S0022-2836(03)00515-1 J. Mol. Biol. (2003) 330, 719–734

Page 2: Using A Neural Network and Spatial Clustering to Predict the ...

broadest functional annotation. It is certain thatsome sequences in the public databases areincorrectly annotated due to the difficulty of trans-ferring function based purely on sequencesimilarity.

As more data comes through from structuralgenomics it is likely that a similar approach willbe taken to annotate proteins using structural hom-ologues. The idea that proteins with similar struc-tures perform similar functions has beenexamined closely.10 – 12 Generally, although it is truethat structure is more conserved than sequence atgreat evolutionary distance, the transfer of functionbased on structural similarity is no more reliablethan annotation based on sequence similarity. Theproblem being the limited number of unique foldsfound in nature, which has been estimated to beas low as 1000.13 Since the number of functions per-formed by proteins far exceeds this number, it fol-lows that one fold must be capable of manyfunctions. The triose phosphate isomerase (TIM)barrel fold, for instance, is associated with 61different EC14 numbers, covering five of the sixtop level EC classifications.15

Methods to locate and characterise the functionalsites of a protein could provide data for functionalannotation in ways not based on homology, aswell as providing information for mutagenesisand drug design studies. Traditional molecularbiology techniques for finding functional sites,such as mutagenesis,16 pH dependence17 andchemical labelling18 are generally time consuming,and rely on some prior knowledge of the functionof the protein to allow it to be assayed.

In silico methods for finding and annotatingfunctional sites would clearly be of great help inannotating novel protein structures from structuralgenomics. Several different strategies have alreadybeen developed, however, none has been used toperform an analysis across the whole structuredatabase. Pattern matching approaches such asTESS,19 FFF20 and SPASM21 aim to locate functionalsites and annotate structures by finding smallthree-dimensional motifs within the structure. Thedisadvantage of this is that suitable motifs have tobe derived (usually from literature, though data-mining techniques have been used for automaticextraction of motifs22) and truly novel structuresmay not match any known motif. Recent studies23

have also presented methods for finding simi-larities between cavities on the protein surfacewhich could be used to annotate structures, oncethe functionally important cavities are identified.

Techniques for finding functional sites de novo,such as evolutionary trace (ET),24 – 26 and other simi-lar methods27 – 29 generally focus on searching forthree-dimensional clusters of conserved residues.ET studies have made genuine, experimentallyconfirmed, predictions for the location of func-tional sites in G-proteins30 and regulator of G-pro-tein signalling (RGS) proteins31 demonstrating thepotential of this type of technique.

Some proteins, including those targetted by

structural genomics, have no sequence homol-ogues and so conservation based approaches donot work. Methods, which only use structuralinformation to locate functional residues, havebeen developed to provide functional annotationfor these proteins.32,33 These techniques rely onidentifying residues with unusual electrostatic andionisation properties, and have shown a correlationbetween these residues and functional sites withinthe protein.

Here, we describe a new method for de novo pre-diction of functional sites specific for the activesites of enzymes. Instead of searching for clustersof conserved residues, a neural network is used toscore the residues of a protein structure by the like-lihood that they are catalytic. By searching for clus-ters of high-ranking residues the algorithmdetermines the most likely active site. The neuralnetwork is trained using a dataset of proteins forwhich the catalytic residues have been confidentlylocated by experiment. Structural parameters suchas the solvent accessibility, type of secondary struc-ture, depth, and cleft that the residue lies in, aswell as the conservation score and residue typeare used as inputs for the neural network.

Results

Analysis of parameters

A detailed analysis of the parameters is providedby Bartlett et al.34 A brief summary is presentedhere.

Conservation was the most powerful parameterfor discriminating catalytic and non-catalytic resi-dues. Some proteins, however, failed to find suffi-ciently diverse homologues to generatemeaningful conservation scores. It is hoped that inthese cases the predictive power of the other par-ameters will be enough to allow reliable predic-tions to still be made.

Catalytic residues show a tendency to be buriedwithin the structure and so have a lower relativesolvent accessibility (RSA) than other residues,particularly non-catalytic polar residues, themajority of which lie exposed on the surface of theprotein. Despite this tendency to be buried, cataly-tic residues are often found lining a large cleft.This tendency is particularly marked for the largestcleft, and is significant for the second and third lar-gest clefts. For clefts smaller than this (fourth toninth largest) the difference is not particularlysignificant.

There is a slight tendency for catalytic residuesto prefer coil regions over helix or sheet regions,this could be due to the extra conformational flexi-bility this gives them (allowing the active site tochange conformation on ligand binding).

The hydrophobic and small residue groups werefound to be very rarely catalytic, presumablybecause they do not contain the chemical groupsrequired for most catalytic tasks. The obvious

720 Predicting Active Sites Using Neural Networks

Page 3: Using A Neural Network and Spatial Clustering to Predict the ...

exceptions to this are when the backbone amideand carbonyl groups perform catalytic functions.It was found that glycine is the residue most oftenused in this case.

Depth

Depth values were calculated for the non-cata-lytic residues in the data set, and the distribution(Figure 1 shows that almost 40% of residues lieon, or near, the surface of the protein, withdepths less than 1 A. These residues are almostcompletely exposed to the surface with only afew of their atoms not solvent accessible. Theproportion of the total represented by each 1 Adivision then decreases steadily, apart from asmall-peak in the 4–5 A division. Presumablythis second-peak is due to invaginations on theprotein surface, which alter the distribution fromthe smooth decrease one would expect given aperfectly spherical protein. The very deepest resi-dues in this data set lie at ,13 A. Catalytic resi-dues show a different distribution, with only17% lying in the outer 1 A, the majority occupy

the next partially buried layer between 2 A and4 A. This allows the catalytic residues to havesome solvent accessibility (in order to interactwith the substrates) whilst remaining mostly bur-ied (to allow themselves to be correctly orien-tated by other residues). The catalytic residuesrarely have depths greater than 5 A.

An example:quinolate phosphoribosyltransferase

As an example of the neural network output, thescores along the 286 amino acid sequence of quino-late phosphoribosyltransferase (1QPR) are shownin Figure 2. Most residues score very low (a largemajority score less than 0.01), and around 20 resi-dues score over 0.5. The four known catalytic resi-dues (Arg105, Lys140, Glu201 and Asp222) allscore highly, though several other residues scoreas high or higher. There is some grouping of thehigh-scoring residues in the sequence, particularlyaround residue 140, but most high-scores are iso-lated spikes. When the scores are mapped on tothe 1QPR structure (Figure 3) the high-scoring

Figure 1. Distribution of residuedepths for non-catalytic residuesand catalytic residues.

Figure 2. The distribution ofneural network scores along thesequence of 1QPR. The true cataly-tic residues are highlighted.

Predicting Active Sites Using Neural Networks 721

Page 4: Using A Neural Network and Spatial Clustering to Predict the ...

areas, although widely separated in the sequence,are brought together and cluster into two areascorresponding to the two active sites of the quino-late phosphoribosyltransferase homodimer.

Training the network

The training process is tracked by measuring theMatthews correlation coefficient (MCC) after eachepoch, Figures 4 and 5 show how the MCC variesas training progresses. The variation in perform-ance is quite considerable, with the final MCCvarying between 0.35 and 0.25, reflecting thenatural variation within the data set. Figure 5shows the MCC varying with each epoch averagedover all ten runs. The network reaches its bestMCC after only 30 epochs or so, levelling off at anaverage MCC of around 0.28. There is no evidenceof over-fitting in the results, as the MCC does notfall significantly once it has plateaued.

Network weights

The relative strength of the weights that the net-work converges to are shown in Figure 6. Conserva-tion and diversity of position score (DOPS) are bothhighly weighted. As expected the network alsolooks for buried residues, as RSA is given a negativeweighting. The cleft categories show that lying in acleft, and the size of that cleft are important factorsin the network score, though not important as con-servation or RSA. Depth is not weighted strongly ineither direction, and is not important in making aprediction. The difference for the secondary struc-ture parameters is also small. Residue type has avery large variation with histidine, cysteine and thecharged residues (aspartate, glutamate, lysine andarginine residues) all scoring highly, whilst thehydrophobic residues score low.

The high DOPS weighting is interesting, as it isthe same for all residues within a protein chain.The only effect is to raise all the scores of all resi-dues in chains with high DOPS and lower all thescores of all residues in chains with low DOPS.The network has learnt that when DOPS is low itis better, in terms of the overall error rate, to makeno catalytic predictions at all, rather than predicteverything to be catalytic. Since the clusteringalgorithm uses residues based on their rank ratherthan absolute scores, this makes no difference inthe later stages.

Figure 3. (a) Distribution of neural network scores inthe 1QPR structure. Residues are coloured by networkscore (Red ¼ high, blue ¼ low). (b) The structure of the1QPR homodimer, coloured by chain, with the knowncatalytic residues drawn in thick lines. All structure dia-grams are prepared using PyMol.58

Figure 4. Training the neural net-work, each line represents one ofthe ten cross validation runs.

722 Predicting Active Sites Using Neural Networks

Page 5: Using A Neural Network and Spatial Clustering to Predict the ...

Clustering

In the network scoring we consider each residue asindependent of the others, however, catalytic resi-dues are likely to cluster together in the structure.Ranking and clustering the residues allows us touse this information to improve the predictions andlocate the active site. For each structure a list of poss-ible catalytic residues is generated by ranking theresidues by network score. The clustering algorithmfinds distinct clusters of these residues and gener-ates a sphere that forms the predicted active site.1158 clusters are generated from the test set, anaverage of 7.2 per protein. The multimeric natureof most of the proteins means that the averagenumber of known active sites is 2.6 per protein.The distribution of sphere sizes for the knownsites and all the predicted sites is shown in Figure7. Figure 8 shows the sizes for the known sitesand the top scoring predicted sites only.

Most predicted clusters are small and containtwo or three members with a radius of 3–4 A, incontrast the top scoring predictions in each struc-ture are generally large and lie at the upper end ofthe allowed size range (15 A). The known sitesgenerate spheres with sizes between 6 A and 12 A,though a significant number have a single catalyticresidue and so have radii of 3 A. A few outliershave spheres larger than 20 A in radius. Thesecases all represent structures where the catalyticcluster is thought to come together upon substratebinding so the cluster appears very large in theunbound form.

Comparing the predicted sites to theknown sites

To test whether a prediction is correct, the overlapbetween the predicted site and the closest knownactive site is calculated. A correct prediction occurs

Figure 5. MCC averaged over allten cross validation runs.

Figure 6. The relative strengths of the weights placed on the various parameters. Categorical parameters such asresidue type are grouped, with the lowest weight set at 0.

Predicting Active Sites Using Neural Networks 723

Page 6: Using A Neural Network and Spatial Clustering to Predict the ...

when the overlap is greater than 50% of the volumeof the known active site, a partially correct predictionoccurs when there is some overlap but less than 50%,a failure occurs when there is no overlap between theknown and predicted spheres.

For each protein in the test set, the predictionwith the highest total network score was selectedand compared to the known sites. The results areshown in Figure 9, 62% of the proteins have theactive site correctly identified, and a further 22%are partially correct.

When we consider the overlap for all the sitespredicted for each protein we find the resultsimprove: 69% of the proteins have the active site

correctly identified and 25% have a partially cor-rect prediction. The increase of only 7% when allpredictions are considered shows that the highestscoring cluster is very often the true active site. Ele-ven cases were found where the top prediction wasnot correct, but one of the other predictions was. Insix of these cases the correct cluster was the secondor third highest scoring prediction and in fourcases the correct cluster was the fourth or fifthhighest scoring, in the final case the correct clusterwas the seventh highest scoring.

When each of the 1158 predicted clusters isconsidered individually, as opposed to by eachprotein, 25% are found to be correct and 41%

Figure 7. Size distribution of allthe predicted sites compared to theknown sites.

Figure 8. Size distribution of thetop scoring predicted sites com-pared to the known sites.

Figure 9. Pie chart showing theper protein accuracy when only thetop prediction is considered foreach protein, and when all predic-tions are considered.

724 Predicting Active Sites Using Neural Networks

Page 7: Using A Neural Network and Spatial Clustering to Predict the ...

are partially correct. The high number of partialhits is presumably due to the tendency of thenetwork to find residues lying near the activesite, but which aren’t close enough to the truecatalytic residues to score as correct. It is alsopossible that many of the partially correct andincorrect clusters represent secondary functionalsites such as ligand binding or protein–proteininteraction sites. These clusters are biologicallyinteresting, but are considered incorrect whensearching solely for active sites.

Significance of results

To calculate the significance of these results, weestimate the probability (PR) of achieving thislevel of prediction by random chance. A similarmethod to that used by Aloy et al.26 is applied. Toa reasonable approximation a correct hit occurswhen the centre of the smaller of the two sphereslies within the volume of the larger. Since theknown catalytic site is usually smaller than the lar-gest predicted site, and assuming the predictionhas an equal probability of being anywhere withinthe volume of the protein, PR is the ratio of thevolume of the predicted sphere to the volume ofthe protein. Since most of the proteins are multi-meric this ratio is then multiplied by the numberof active sites (any one of which could have over-lapped with the predicted site).

The volume of each protein is estimated bydrawing a sphere around all the Cb atoms of thestructure, giving an average of 510,000 A3. Sincemost catalytic residues lie in the outer 5 A of theprotein we shall consider the predictions restrictedto only a third of this volume. The average volumeof all the predicted spheres is 2632 A3, and theaverage volume of the top scoring predictions is5783 A3. There are 7.2 predicted sites and 2.6known sites per protein on average. A summaryof the observed and expected rates of correct pre-dictions for the three different analysis is shownin Table 1. We estimate the significance of thedifferences using equation (1),26 which follows anormal distribution with mean 0 and standarddeviation 1. All the results are significant to morethan 10215.

z ¼PO 2 PRffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPRð1 2 PRÞ

n

r ð1Þ

Comparison of the performance ofdifferent networks

The neural network and clustering process incor-porates a variety of different types of information:evolutionary information encoded in the conserva-tion scores, residue propensities, structural infor-mation in the parameters and detailed structuralinformation included in the clustering stage. Tounderstand how these different types of infor-mation contribute to the overall performance, net-works have been trained using different subsets ofthe parameters.

Two additional networks have been developed.First a network trained solely using sequence par-ameters (conservation, DOPS and residue propen-sities), and second a network trained usingstructural parameters but excluding conservationand DOPS scores. Residue propensity is includedin the structural information as the sequence of aprotein would always be known given a structure.The relative performance of the different networksis shown in Figure 10 and in detail in Table 2. Theperformance of each network in finding thelocation of the active site is shown in Figure 11and Table 3.

The performance of the technique described byAloy et al.,26 which uses conservation, residue pro-pensity and clustering is also shown in Table 3.This study uses the same sphere based method asshown here to assess the accuracy of the predic-tions making comparison easy, The functional resi-dues were based on SITE records in PDB files,which are not as well defined as the catalytic resi-dues used here. Aloy et al. analysed 106 proteinsand found that 20 of them could not generate suffi-ciently diverse alignments to give good predic-tions. Since, here, we have included proteins withlow DOPS, we include these 20 proteins as incor-rect predictions when calculating performance.Once this is taken into account we find that of the106 proteins, 68 are correctly predicted (64%), 13partially correct (12%) and 25 are incorrect (24%),when all predictions are considered. This level ofprediction is almost identical to the sequencetrained network.

Predictions

The neural network was run on several recentlypublished enzyme structures, which were notincluded in the original data, or subsequent analysis,

Table 1. Observed and expected frequencies of correct results for the three analysis

Per site ðn ¼ 1158Þ(%) Per protein ðn ¼ 159Þ(%) Top site ðn ¼ 159Þ(%)

ExpectedðPRÞ

Expected ðPRÞ(1/3 Vol)

ObservedðPoÞ

ExpectedðPRÞ

Expected ðPRÞ(1/3 Vol)

ObservedðPoÞ

ExpectedðPRÞ

Expected ðPRÞ(1/3 Vol)

ObservedðPoÞ

1.3 4.4 24.7 9.6 32.2 69.2 2.9 9.8 62.3

Predicting Active Sites Using Neural Networks 725

Page 8: Using A Neural Network and Spatial Clustering to Predict the ...

Figure 10. Comparison of theMCC achieved by the three differ-ent networks in predicting catalyticresidues, before and after structuralclustering is applied.

Table 2. Comparison of the performance of the three different neural networks in predicting catalytic residues

Before clustering After clustering

Data used MCC QPredicted QObserved MCC QPredicted QObserved

Structure 0.19 0.10 0.41 0.23 0.10 0.57Sequence 0.24 0.13 0.50 0.26 0.13 0.58Sequence þ structure 0.28 0.14 0.56 0.32 0.16 0.68

Figure 11. Comparison of the site prediction accuracy for the three different networks. Results are presented consid-ering all predicted sites and the top scoring site only.

Table 3. Comparison of the performance of the three different neural networks in locating active sites

Top sites only (%) All sites (%)

Data used Correct Partial Incorrect Correct Partial Incorrect

Structure 52.8 25.8 21.4 62.3 31.4 6.3Sequence 57.2 27.7 15.1 63.5 28.3 8.2Sequence þ Structure 62.3 21.4 16.4 69.2 24.5 6.3Aloy et al.26 – – – 64.2 12.3 23.5

726 Predicting Active Sites Using Neural Networks

Page 9: Using A Neural Network and Spatial Clustering to Predict the ...

to gauge the usefulness of the method in annotatingstructures.

SET domain histone lysine methyltransferases

Several recent papers35 – 40 have presented thefirst structures of histone lysine methyltransferase(HMTase) containing SET domains. SET domainsare responsible for the methylation of specificlysine residues in histone proteins, leading tochanges in chromatin regulation and geneexpression. SET domains share no homology withother structurally characterised methyltransferases,and so the structure, and the functional infor-mation that the structure contains is of significantimportance.

The structure of yeast protein Clr4 (PDB code1MVX) was used for the prediction of the func-tional sites. The PSI-BLAST search required 7 iter-ations to converge (using an E-value cut-off of10220) and found ,150 homologues, producing avery diverse alignment. 31 residues score over theranking cut-off and clustering reveals one largecluster, containing 19 of these residues. The outputof the neural network mapped to the surface ofthe structure is shown in Figure 12. The dominantcluster forms the large L-shaped patch in the centreof the structure comprising residues His410,Asp450, Asn409, and Tyr451, the other residues inthe cluster extend either side of the L-shape patchand into the structure.

Mutations to His410, Cys412, Arg320, Glu446and Arg406 have been shown to inactivate theenzyme,41,42 though it is suggested that Arg320 ismost likely to be of structural, rather than catalyticimportance. The structure with an AdoHcy cofac-tor bound is known for homologous SET domains.This reveals that Tyr451, Asn409 and His410 makecontacts to the cofactor and Tyr357 is proposed asa possible catalytic proton source. A high resol-ution crystal structure of the human SET7/9domain has been recently published.40 This studysuggests catalytic roles for the residues equivalentto Tyr451, Tyr419, Tyr357 and the main-chain car-bonyl oxygens of Asp403 and Phe408. Of thesefunctional residues, the large predicted cluster con-tains His410, Glu446, Arg406, Tyr357, Tyr451 andAsp403. The neural network identifies the correctactive site and many of the known functionalresidues.

Intron endonuclease I-TevI

The structure of the intron endonuclease I-TevIfrom bacteriophage T4 has recently beenpublished.43 Intron endonucleases catalyse a breakin double stranded DNA, that facilitates the inser-tion of introns and inteins. I-TevI contains separatecatalytic and DNA binding domains, the structureof the catalytic domain is analysed here (PDB code:1LN0). Using the default PSI-BLAST parameters thesequence of 1LN0 picks up no homologues.

Figure 12. (a) The front face of the SET domain showing the large, high-scoring surface patch and the Ado-HCysbinding cleft. Residues His410, Asp450, Asn409 and Tyr451 form the L-shaped patch in the centre, Arg406 andTyr357 lie in the pocket to the right. (b) The catalytic site of I-TevI, network scores are generated without conservationdata. (c) The catalytic site of I-TevI, showing the improved prediction once conservation data is included. (d) L-Arabi-nanase, the central red patch is made of His37 and Asp38. Asp158 and Glu221 both lie close by in the same pocket.(e) FemA, the binding cleft is the long green patch in the centre of the figure. The most likely catalytic site lies at thefar left end of the cleft. (f) The RlmB dimer. The subunits are stacked on top of each other running left to right. Thetwo active sites lie in the high-scoring regions in the interface between the two subunits.

Predicting Active Sites Using Neural Networks 727

Page 10: Using A Neural Network and Spatial Clustering to Predict the ...

Despite this the network still makes predictionsbased purely on the structure and the residue pro-pensities. The network identifies three residues(His31, His40, Ser42) forming the highest scoringcluster.

A putative active site is proposed based on con-servation and mutagenesis data.44 The site islocated in the same cleft identified by the network.Glu75 binds a divalent cation and is likely to bethe principal functional residue. Other functionalresidues suggested by the authors include Tyr17,Arg27, His31 and His40. The 1LN0 structure hasArg27 mutated to alanine, as active I-TevI cannotbe produced by Escherichia coli. Replacing Ala27by arginine in the sequence presented to PSI-BLAST, and reducing the E-value cut-off to 1025,allows the network to improve the prediction.Twelve residues now form the largest clusterincluding Tyr17, Arg27, His31, and His40. Glu75still remains outside the predicted cluster, however.

This example demonstrates how the network cancope with structures occupying a sparsely popu-lated region of sequence space. The predictionmade only on the basis of residue propensitiesand structural data correctly identifies the activesite and several functional residues. Once themutated structure is corrected and conservationscores are added the network makes improved pre-dictions, correctly identifying the active site andmany of the principal residues, though it still failsto predict the crucial Glu75. The problems ofmutated structures and limited sequence homol-ogues highlight some of the difficulties that wouldbe encountered in a PDB-wide analysis.

a-L-Arabinanase

The structure of Cellvibrio japonicus arabianasehas been solved recently45 revealing a novel five-bladed b-propeller fold (PDB code: 1GYD). Arabia-nase hydrolyses the arabinans polymers found inplant cell walls. The PSI-BLAST search convergesafter only four iterations, only finding 11 homol-ogues, however, the alignment is quite diverseand useful conservation scores are obtained. Thehighest scoring cluster lies centred around thehigh-scoring pair of residues His37 and Asp38.The other residues in the cluster are Ser86, Ser112,His92, Trp94, Gln316, Asp158, Thr58, His291,Tyr308, Ser52 and Thr53.

The authors of the paper used analogy withother enzymes,46 conservation, and mutagenesis toidentify Asp38 and Glu221 as the likely catalyticgroups. A third carboxylate, Asp158, is suggestedto be involved in pKa modulation or positioningof the Glu221 side-chain.

The neural network correctly identifies the threeacidic residues as catalytic (all are highly ranked),however, the clustering algorithm does not linkGlu221 into the cluster containing Asp38 andAsp158 (even though Glu221 is the highest scoringresidue in the protein). Altering the clustering par-ameters to join residues separated by less than 5 A

(rather than the default 4 A) allows Glu221 to jointhe main cluster.

FemA

FemA is a Staphylococcus aureus protein ident-ified as a member of the Fem (factors essential formethicillin resistance) family, a series of antibioticresistance genes.47,48 FemA is responsible for theaddition of glycines to peptidoglycan molecules inthe bacterial cell wall. The structure is the firstexample of this important family49 (PDB code:1LRZ). PSI-BLAST converges after four iterationsfinding 40 homologues and generates a diversealignment. The network scores mapped to thestructure are shown in Figure 12. The high-scoringresidues line the large cleft that runs the length ofthe protein. The clustering algorithm suggests aseven residue cluster comprising the high-scoringresidues His106 and His29, and five other lowerscoring residues. This cluster lies at the very endof the cleft. Another five residue cluster liesapproximately halfway along the cleft comprisingLys383, Phe382, Ser342, Ser314 and Thr332. Thecrystal structure does not have any ligand bound,and no mutagenesis data is available to pinpointthe actual catalytic residues.

The cleft is the only structure large enough toaccommodate the peptidoglycan substrate andhence is the most likely binding site, though a con-formational change on substrate binding cannot beruled out. The network suggests several residuesas potential catalytic groups and further exper-imentation is required to confirm which, if any, ofthese residues form the catalytic centre.

RlmB 23 S rRNA Methyltransferase

RlmB is an Escherichia coli protein representingthe novel Ado-Met dependent methyltransferaseclass, SPOUT. RlmB is responsible for the methyl-ation of a specific guanosine group in the 23 SrRNA component of the ribosome.50 The crystalstructure of the enzyme has recently been solved51

(PDB code: 1GZ0). PSI-BLAST converges after iter-ation five, having found 100 homologues and gen-erates a very diverse alignment. RlmB forms ahomodimer in solution and the high-scoring resi-dues cluster into two almost identical sites in thedimer interface region. Each site contains residuesfrom both chains A and F. The highest scoring resi-due is Arg114 which is involved in a salt-bridgewith Glu198 from the opposite chain. Surroundingthis pair are His9, Asp117, Glu147, Ser148 andGly144 from the same chain as Arg114 and Ser224,Leu225, Asn226 and Ser228 from the same chainas Glu198. A secondary cluster comprised ofAsp105, His107 and Asn108 lies 4.3 A from thismain cluster.

The authors propose a putative active site basedon conservation of three previously identifiedmotifs, found in most methyltransferases.52,53

Motif 1 covers residues Asn108 to Arg114, motif II

728 Predicting Active Sites Using Neural Networks

Page 11: Using A Neural Network and Spatial Clustering to Predict the ...

covers Glu198 and motif III covers Ser224, Leu225and Asn226. They also report that mutagenesis ofthe equivalent residue to Glu198 in a homologueabolishes methyltransferase activity. Glu198 andSer224 are suggested as possible catalytic bases.His9 is implicated in RNA binding, however, sev-eral other putative RNA binding residues are notidentified strongly by the network.

The network has correctly identified the putativecatalytic centre, though again the clustering hassplit the site, leaving part in a small secondarycluster.

Discussion

One of the original aims of the project, to predictcatalytic residues from structures, has proven tobe an extremely difficult task given the narrowdefinition of catalytic used here. The MCC of 0.28(or 0.32 if clustering is used) is too low to realisti-cally use the simple predictions from the neuralnetwork in identifying catalytic residues directly.The main problem is the high number of false posi-tives. 56% of catalytic residues are identified cor-rectly, but only one in seven catalytic predictionsare correct.

Visual inspection of the results shows that manyof the false positives are other functional residueslying in the active site such as substrate bindingand metal binding residues. These residues havevery similar properties to the catalytic residues:conserved, low-solvent accessibility, lying in cleftsand they also lie extremely close to the true cataly-tic residues and do not form a distinct or separatespatial cluster. A system looking to identify anyfunctional residues at the active site may well con-sider these false positives to be true positives, how-ever, given the definition used here they are errors.As well as the problem of these false positivesthere is the inherent difficulty of picking the hand-ful of catalytic residues from hundreds in the pro-tein. The ratio of catalytic to non-catalytic isaround one in one hundred across the entire dataset. Given these difficulties the low success rate isunderstandable and not as disappointing as firstappears.

The network weights and the performance of thesequence-only neural network shows that evol-utionary information, encoded in conservationscores is very important in making a prediction.This network reflects the performance that onecould expect to achieve when predicting catalyticresidues purely from sequence data. We see fromthe QObserved and QPredicted values in Table 2 that50% of catalytic residues are found by this net-work, but only one in eight of the predictions iscorrect.

Structural genomics projects aim to providesome level of structural information for themajority of protein sequences. Some of these pro-teins will not have any known sequence homol-ogues and the structure will be the only

information available. The neural network trainedwithout conservation scores reflects the perform-ance one could expect to achieve when analysingthese proteins. The network alone performs poorly,however, the structural information can also beused to cluster the predictions in these proteins.When this form of structural information is incor-porated the overall performance rises almost tothe level of the sequence network, and 57% of thecatalytic residues are correctly predicted, thoughthe true positives are still only one in ten of the cat-alytic predictions.

For the majority of structural genomics targetsthere are some sequence homologues and in thesecases both types of information can be incorpor-ated. The network trained using sequence andstructure outperforms both the other networkswith an MCC of 0.28 rising to 0.32 when clusteringis used (Table 2). 68% of catalytic residues are cor-rectly predicted and one in six of the catalytic pre-dictions is correct.

Although predicting the catalytic residues is dif-ficult, predicting the location of the active site canbe done with significant levels of success (Table 3).When only structural information is used the clus-tering algorithm is still able to correctly identifythe catalytic cluster in 62% of proteins and a par-tially correctly in a further 31%. This suggests thateven for structural genomics targets where no con-servation data is available, it will still be possible tomake significant predictions about the location ofthe active site.

The neural network trained using sequencedata identifies 63.5% of sites when all predic-tions are considered. This level of performanceis similar to the technique described by Aloyet al.26 which also uses conservation, residue pro-pensities and clustering. It should be noted thatAloy et al. compared their predictions to theSITE records of PDB files, which are less rigor-ously defined than the catalytic clusters usedhere, and generally comprise larger number ofresidues. The performance of the neural net-works used here are likely to be underestimatedcompared to Aloy et al., therefore.

As with the neural network output, when struc-ture and sequence are combined the performanceexceeds that of sequence or structure alone. In thiscase 69% of sites are correct considering all predic-tions and 62% considering only the top prediction.A further 25% of sites are partially correctly pre-dicted when all predictions are considered and22% when only the top prediction is considered.The method fails to make a useful prediction inonly 6% of cases when all the predictions areexamined.

One of the justifications for the large investmentmade in structural genomics is that it will allowidentification of functional sites and residues incases where it is not possible from sequence. Theresults we have shown here indicate that structurealone can be used to identify catalytic residuesand active sites in enzymes, however, evolutionary

Predicting Active Sites Using Neural Networks 729

Page 12: Using A Neural Network and Spatial Clustering to Predict the ...

history encoded in the form of conservation scoresis an extremely rich source of information for mak-ing these types of predictions and should be incor-porated at every opportunity. The improvement inperformance when structure and sequence areused, shows that structural information, otherthan that used for clustering, should be incorpor-ated into de novo prediction techniques such asevolutionary trace.

Why did the failures fail?

When considering the top scoring sites in eachprotein we find that 16% of the proteins failed tofind any overlap between the predicted spheresand the known catalytic cluster. It is important tounderstand why these failures occurred in orderto improve the algorithm and assess whetherthere are specific types of enzyme on which thealgorithm performs consistently badly.

Poor alignments

The alignments automatically generated by PSI-BLAST are the most likely point of failure. Theoptimal E-value cut-off for each family variesdepending on its size and diversity. The singleE-value cut-off used represents the best compro-mise, but still generates poor alignments for somefamilies. To test whether poor alignments are themajor source of error the difference between theconservation of the catalytic residues and the con-servation of all residues was calculated and aver-aged for each group of results (correct, partial andincorrect), the results are shown in Figure 13. Thedifferent groups clearly show a variation in the dis-tinction between conservation of catalytic and non-catalytic residues. In the correctly predicted groupthe difference is more than 0.3, this falls to 0.25 forthe partially correct group, and the incorrectgroup has an average difference of only 0.15.Clearly, given the importance of conservationscores in making predictions, a lack of differen-tiation between the conservation of catalytic andnon-catalytic residues will reduce the overallaccuracy.

This trend implies that unusual conservationscores are responsible for a large part of the failurerate. The low difference in conservation scores inthe failure group could be explained if these pro-teins all had low DOPS. The DOPS for each proteinchain were averaged for each category and are alsoshown in Figure 13. There is a correlation betweenDOPS and the success of a prediction, however,looking at the scores themselves shows that,although some chains have very low DOPS, mostare just as high as the average correctly predictedprotein. If low DOPS were responsible for all ofthe failures then one would not expect the averageconservation of the catalytic residues to vary acrossthe three groups. However, a clear trend of increas-ing catalytic conservation in the correct predictionsis detected and shown in Figure 13.

How then to explain these anomalous conserva-tion scores? The assumption must be that theseenzymes are part of a larger family of proteins,which have different catalytic activities. Catalyticresidues conserved within a sub-family wouldtherefore vary between members of the family andnot be necessarily conserved. Several examples ofthis can be seen in the failed structures. Calpain(1DKV) for instance contains an EF-hand domain,which is even found in non-enzymes. This meansthe catalytic residues of Calpain are not conservedin many of the homologues a PSI-BLAST searchreturns, whilst other residues involved in formingthe EF-hand are conserved. This pattern of conser-vation is the inverse of what the network is expect-ing, and so it fails to correctly predict the catalyticresidues.

Clustering errors

Of the 26 structures that failed to find the activesite when only the top site was considered, tenalso failed when all sites were considered. In theseten cases the error occurs prior to clustering, gener-ally with poor alignments from PSI-BLAST. Of theremaining 16, 11 generated a lower scoring correctcluster and five generated a lower scoring partiallycorrect cluster. These 16 cases are failures of theclustering algorithm to find the right cluster,

Figure 13. The difference inDOPS and conservation betweencatalytic and non-catalytic residuesin the three groups of results.

730 Predicting Active Sites Using Neural Networks

Page 13: Using A Neural Network and Spatial Clustering to Predict the ...

presumably because the signal from the true activesite was weak compared to other sites in theprotein.

If each structure is analysed by hand, the fault isgenerally obvious. The single-linkage algorithm isprone to forming long aspherical clusters, sincetwo separate clusters can be joined even if only asingle residue joins them. In several failures thetrue active site is a relatively compact cluster witha few high-scoring residues, whilst the top scoringprediction is a large cluster which out-scores theothers by its size even if no single residue scoreshighly. Another problem is that the algorithmtends to select clusters buried in the protein, sincethese contain more residues than surface clusters,a human can easily spot that these are not suitableactive sites.

Further work

Analysing the structures for high-scoring surfacepatches, as well as simple clusters might help inidentifying the location of active sites, particularlyif the top scoring cluster is deeply buried andhence unsuitable as a catalytic centre. Patch anal-ysis has been used to identify other surface fea-tures, such as protein–protein interaction sitesand ligand binding pockets.

The predicted clusters can also be used to auto-matically generate three-dimensional templates foranalysis by one of the pattern searching algor-ithms, such as TESS and SPASM. Designing tem-plates by hand is a time consuming job andautomated methods, such as this and the methodrecently described by Oldfield,22 could be used toquickly generate starting templates suitable formanual refinement.

The basic methodology of neural network scor-ing of residues and spatial clustering could beused to find other types of functional sites such asnon-obligate protein–protein interfaces or pro-tein–DNA interaction sites. Many of the secondaryclusters found by this network, may be more confi-dently predicted by networks trained on theseother functional classes. A novel protein structurecould be presented to each network in turn anddifferent types of functional sites identified at eachstage.

Materials and Methods

Protein test set

The protein test set and the compilation of the data aredescribed in detail in a recent paper by Bartlett et al.34

The original test set contains some proteins with homolo-gous non-catalytic domains, for this study these redun-dant structures have been removed. The final test setcontains 159 proteins from the PDB,54 containing nohomologous pairs and covering all six top level enzymeclassification (EC)14 numbers. This data set containsapproximately 55,000 non-catalytic residues and 550 cat-alytic residues available for training the network.

Compilation of data

The catalytic residues were defined using the follow-ing rules:

(1) Direct involvement in the catalytic mechanism(e.g. as a nucleophile).

(2) Exerting an effect on another residue or watermolecule, which is directly involved in the catalyticmechanism, which aids catalysis (e.g. by electrostaticor acid–base action).

(3) Stabilisation of a proposed transition-stateintermediate.

(4) Exerting an effect on a substrate or cofactorwhich aids catalysis, e.g. by polarising a bond whichis to be broken. Includes steric and electrostatic effects.

Note that residues that bind substrate, cofactor ormetal ions are not included, unless they also performone of the functions listed above.

Many studies have used the SITE records defined inPDB files as the basis for defining functional residuesand sites. Unfortunately SITE records are not a hom-ogenous data set, and there are no fixed rules on whatmay or may not be included in a SITE entry. Only 13 ofthe 159 PDB files in our data set contain SITE records,less than 10%. These 13 structures contain 50 catalyticresidues, as defined above and 94 SITE residues. Theoverlap between these two groups contains 36 residues.We find therefore that in our data set 28% of catalyticresidues are not found in the SITE records and only 38%of SITE residues are catalytic.

The following parameters were derived for each resi-due (catalytic and non-catalytic) in all 159 proteins:

. Conservation. The sequence of each chain in theprotein was used to initiate a PSI-BLAST search ofthe NCBI non-redundant database (NRDB) withan E-value cut-off of 10–20 for inclusion in thenext iteration. Each PSI-BLAST search was run toconvergence or a maximum of 20 iterations. Thefinal multiple alignment generated by PSI-BLASTwas then scored for conservation and DOPS asdescribed by Valdar et al.36

. Relative Solvent Accessibility (RSA). NACCESS37 wasused with standard parameters to calculate theRSA of each residue.

. Secondary structure. DSSP55 was used to extract thesecondary structure for each residue. The DSSPclassification was simplified to three categories:helix, sheet or coil/other.

. Cleft. Surfnet56 was used to define in which, if any,cleft the residue lay. If a residue lay in two ormore clefts only the largest was recorded.

. Depth. The depth of a residue within the proteinstructure is defined as the average minimum dis-tance between each of its atoms and the closest sol-vent accessible atom in the structure. NACCESSwas used to define solvent accessibility.

Encoding and generation of data sets

Conservation, as calculated above, is already encodedas a suitably scaled factor between 0 and 1 (0 for no con-servation and 1 for perfect conservation) and so is passed

Predicting Active Sites Using Neural Networks 731

Page 14: Using A Neural Network and Spatial Clustering to Predict the ...

to the network as is. The RSA is a percentage and isscaled to between 0 and 1 before presentation to the net-work. Depth is scaled so that the deepest residue ineach structure is scored 1 and surface residues 0.

The other parameters: residue type, secondary struc-ture and cleft are categorical in nature, and are encodedusing 1-of-C encoding. Amino acid type is encoded asan array of 20 inputs where one input is set to 1 and therest to 0. Secondary structure is encoded by three inputparameters. Cleft size is divided into four categories: nocleft, largest cleft, second or third largest cleft and fourthto ninth largest cleft.

An example encoding is shown in Figure 14 for a ser-ine residue with conservation 0.7, DOPS score 0.9, depth0.3, RSA 15%, in a coil region and lying in the largestcleft.

Training the neural network

The neural network software used is FFNN,57 a feedforward neural network trained using a scaled conjugategradients algorithm. A single-layer architecture is usedin all cases. In order to accurately measure the perform-ance of the network it is trained using a ten fold crossvalidation experiment. The dataset is divided into tenequal subgroups, and then in each training run nine ofthe groups are used for training, whilst the network istested on the single remaining group. The network isrun ten times using a different subgroup as the testgroup each time. Here the dataset was divided by struc-ture rather than residue, so each subgroup contains thedata for approximately 16 structures. The ratio of cataly-tic to non-catalytic residues is approximately 1:60 in thetraining set. Presenting the data in this ratio causes thenet to predict every residue as non-catalytic. The bestbalanced training set was found to have a ratio of 1:6.Each training group is balanced by discarding a randomselection of the non-catalytic residues prior to training.Training was for 100 epochs, in every case the networkconverged to a stable error-level before training was ter-minated. The number of training epochs was not opti-mised, and in particular the performance of the test setwas not used to optimise the stopping point in any way.

Measuring performance

In order to judge the neural network learning process,a suitable measure of performance is required. Totalerror (percentage of incorrect predictions) is not suffi-cient due to the highly unbalanced nature of the dataset.All of the statistics are derived from the following quan-tities:

p ¼ Number of correctly classified catalyticresidues.

n ¼ Number of correctly classified non-catalyticresidues.

o ¼ Number of non-catalytic residues incorrectlypredicted to be catalytic (over-predictions).

u ¼ Number of catalytic residues incorrectly pre-dicted to be non-catalytic (under-predictions).

t ¼ Total residues (p þ n þ o þ u).

The total error ðQTotalÞ is given by equation (2):

QTotal ¼p þ n

t£ 100 ð2Þ

To complement this, two other measures of performanceare used, QPredicted measures the percentage of catalyticpredictions that are correct and QObserved measures thepercentage of catalytic residues that are correctly pre-dicted. The formulae for these two parameters areshown in equations (3) and (4)):

QPredicted ¼p

p þ o£ 100 ð3Þ

QObserved ¼p

p þ u£ 100 ð4Þ

A measure of performance that takes both these factorsinto account is the MCC. The formula for calculatingMCC is shown in equation (5):

MCC ¼pn 2 ouffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ðp þ oÞðp þ uÞðn þ oÞðn þ uÞp ð5Þ

Ranking and clustering

The residues in each structure are ranked by networkscore, and all residues scoring above a cut-off value areused in the clustering algorithm. A pair of residues areclustered together if any of their atoms lies within 4 Aof each other. Each cluster is then defined as a spherewith its centre at the geometric centroid of all the Cb

atoms of the component residues (Ca for glycine) and aradius such that all the Cb atoms lie within the sphere.The first ranking cut-off was set at 35% of the highestscoring residue. If any sphere in a structure had a radiusgreater than 15 A, the clustering was repeated, increasingthe ranking cut-off by 1% until no sphere was greaterthan 15 A in radius. Single residue clusters were dis-carded at this stage.

The definition of the known sites is the same. Sphereswere defined for each active site with centres at the cen-troid of the Cb atoms and radii such that all the Cb

atoms are within the sphere. Proteins with single cataly-tic residues were set a radius of 3 A.

Acknowledgements

Thanks to Dr Adrian Shepherd at UCL and DrCraig Porter at the EBI for useful discussions onneural networks, clustering algorithms and othertopics. Thanks to the Medical Research Councilfor financial support. G.J.B. was supported by aBBSRC CASE studentship in association withRoche Products Ltd.

References

1. Burge, C. & Karlin, S. (1997). Prediction of complete

Figure 14. Example of the neural network input encoding.

732 Predicting Active Sites Using Neural Networks

Page 15: Using A Neural Network and Spatial Clustering to Predict the ...

gene structures in human genomic DNA. J. Mol. Biol.268, 78–94.

2. Birney, E. & Durbin, R. (2000). Using genewise in theDrosophila annotation experiment. Genome Res. 10,547–548.

3. Burley, S. K., Almo, S. C., Bonanno, J. B., Capel, M.,Chance, M. R., Gaasterland, T. et al. (1999). Structuralgenomics: beyond the human genome project. NatureGenet. 23, 151–157.

4. Shapiro, L. & Harris, T. (2000). Finding functionthrough structural genomics. Curr. Opin. Biotechnol.11, 31–35.

5. Altschul, S. F., Madden, T. L., Schaer, A. A., Zhang, J.,Zhang, Z., Miller, W. & Lipman, D. J. (1997). Gappedblast and psi-blast: a new generation of protein data-base search programs. Nucl. Acids Res. 25, 3389–3402.

6. Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwil-ler, L., Eddy, S. R. et al. (2002). The pfam proteinfamilies database. Nucl. Acids Res. 30, 276–280.

7. Karp, P. D. (1998). What we do not know aboutsequence analysis and sequence databases. Bioinfor-matics, 14, 753–754.

8. Devos, D. & Valencia, A. (2000). Practical limits offunction prediction. Proteins: Struct. Funct. Genet. 41,98–107.

9. Wilson, C. A., Kreychman, J. & Gerstein, M. (2000).Assessing annotation transfer for genomics: quanti-fying the relations between protein sequence, struc-ture and function through traditional andprobabilistic scores. J. Mol. Biol. 297, 233–249.

10. Hegyi, H. & Gerstein, M. (1999). The relationshipbetween protein structure and function: a compre-hensive survey with application to the yeast genome.J. Mol. Biol. 288, 147–164.

11. Orengo, C. A., Todd, A. E. & Thornton, J. M. (1999).From protein structure to function. Curr. Opin. Struct.Biol. 9, 374–382.

12. Thornton, J. M., Orengo, C. A., Todd, A. E. & Pearl,F. M. (1999). Protein folds, functions and evolution.J. Mol. Biol. 293, 333–342.

13. Chothia, C. (1992). Proteins. One thousand familiesfor the molecular biologist. Nature, 357, 543–544.

14. Bairoch, A. (1993). The data bank. Nucl. Acids Res. 21,3155–3156.

15. Nagano, N., Porter, C. T. & Thornton, J. M. (2001).The (beta/alpha)(8) glycosidases: sequence andstructure analyses suggest distant evolutionaryrelationships. Protein Eng. 14, 845–855.

16. Morrison, K. L. & Weiss, G. A. (20001). Combinator-ial alanine-scanning. Curr. Opin. Chem. Biol. 5,203–207.

17. Zhou, X. & Toney, M. D. (1999). pH studies on themechanism of the pyridoxal phosphate-dependentdialkylglycine decarboxylase. Biochemistry, 38,311–320.

18. Aktories, K. (1997). Identification of the catalytic siteof clostridial ADP-ribosyltransferases. Advan. Expt.Med. Biol. 419, 53–60.

19. Wallace, A. C., Borkakoti, N. & Thornton, J. M.(1997). Tess: a geometric hashing algorithm for deriv-ing 3D coordinate templates for searching structuraldatabases. Application to enzyme active sites. ProteinSci. 6, 2308–2323.

20. Fetrow, J. S., Godzik, A. & Skolnick, J. (1998). Func-tional analysis of the Escherichia coli genome usingthe sequence-to-structure-to-function paradigm:identification of proteins exhibiting the glutare-doxin/thioredoxin disulphide oxidereductaseactivity. J. Mol. Biol. 282, 703–711.

21. Kleywegt, G. J. (1999). Recognition of spatial motifsin protein structures. J. Mol. Biol. 285, 1887–1897.

22. Oldfield, T. J. (2002). Data mining the protein databank: residue interactions. Proteins: Struct. Funct.Genet. 49, 510–528.

23. Schmitt, S., Kuhn, D. & Klebe, G. (2002). Newmethod to detect related function among proteinsindependent of sequence and fold homology. J. Mol.Biol. 323, 387.

24. Lichtarge, O., Bourne, H. R. & Cohen, F. E. (1996). Anevolutionary trace method defines binding surfacescommon to protein families. J. Mol. Biol. 257,342–358.

25. Madabushi, S., Yao, H., Marsh, M., Kristensen, D. M.,Philippi, A., Sowa, M. E. & Lichtarge, O. (2002).Structural clusters of evolutionary trace residues arestatistically significant and common in proteins.J. Mol. Biol. 316, 139–154.

26. Aloy, P., Querol, E., Aviles, F. X. & Sternberg, M. J.(2001). Automated structure-based prediction offunctional sites in proteins: applications to assessingthe validity of inheriting protein function from hom-ology in genome annotation and to protein docking.J. Mol. Biol. 311, 395–408.

27. Landgraf, R., Xenarios, I. & Eisenberg, D. (2001).Three-dimensional cluster analysis identifies inter-faces and functional residue clusters in proteins.J. Mol. Biol. 307, 1487–1502.

28. Pupko, T., Bell, R. E., Mayrose, I., Glaser, F. & Ben-Tal, N. (2000). Rate4site: an algorithmic tool for theidentification of functional regions in proteins bysurface mapping of evolutionary determinantswithin their homologues. Bioinformatics, 18, S71–S77.

29. Armon, A., Graur, D. & Ben-Tal, N. (2001). Consurf:an algorithmic tool for the identification of functionalregions in proteins by surface mapping of phyloge-netic information. J. Mol. Biol., 447–463.

30. Lichtarge, O., Bourne, H. R. & Cohen, F. E. (1999).Evolutionarily conserved g-alphabetagamma bind-ing surfaces support a model of the protein-receptorcomplex. Proc. Natl Acad. Sci. USA, 93, 7507–7511.

31. Sowa, M. E., He, W., Wensel, T. G. & Lichtarge, O.(2000). Regulator of protein signalling interactionsurface linked to effector specificity. Proc. Natl Acad.Sci. USA, 97, 1483–1488.

32. Elcock, A. H. (2001). Prediction of functionallyimportant residues based solely on the computedenergetics of protein structure. J. Mol. Biol., 885–896.

33. Ondrechen, M. J., Clifton, J. G. & Ringe, D. (2001).Thematics: a simple computational predictor ofenzyme function from structure. Proc. Natl Acad. Sci.USA, 98, 12473–12478.

34. Bartlett, G. J., Porter, C. T., Borkakoti, N. & Thornton,J. M. (2002). Analysis of catalytic residues in enzymeactive sites. J. Mol. Biol. 324, 105–121.

35. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G.,Bhat, T. N., Weissing, H. et al. (2000). The proteindata bank. Nucl. Acids Res. 28, 235–242.

36. Valdar, W. S. (2002). Scoring residue conservation.Proteins: Struct. Funct. Genet. 48, 227–241.

37. Hubbard, S. J. & Thornton, J. M. (1993). “NACCESS”,Computer Program, Department of Biochemistryand Molecular Biology, University College, London.

38. Kabsch, W. & Sander, C. (1983). Dictionary of proteinsecondary structure: pattern recognition of hydro-gen-bonded and geometrical features. Biopolymers,22, 2577–2637.

39. Laskowski, R. A. (1995). A program for visualizing

Predicting Active Sites Using Neural Networks 733

Page 16: Using A Neural Network and Spatial Clustering to Predict the ...

molecular surfaces, cavities, and intermolecularinteractions. J. Mol. Graph. 13, 323–330. 307–308..

40. Shepherd, A. J., Gorse, D. & Thornton, J. M. (1999).Prediction of the location and type of beta-turns inproteins using neural networks. Protein Sci. 8,1045–1055.

41. DeLano, W. L. (2002). The PyMol User’s Manual,DeLano Scientific, San Carlos, CA.

42. Min, J., Zhang, X., Cheng, X., Grewal, S. I. & Xu, R. M.(2002). Structure of the domain histone lysine meth-yltransferase clr4. Nature Struct. Biol. 9, 828–832.

43. Jacobs, S. A., Harp, J. M., Devarakonda, S., Kim, Y.,Rastinejad, F. & Khorasanizadeh, S. (2002). The activesite of the domain is constructed on a knot. NatureStruct. Biol. 9, 833–838.

44. Wilson, J., Jing, C., Walker, P., Martin, S., Howell, S.,Blackburn, G. et al. (2002). Crystal structure andfunctional analysis of the histone methyltransferaseset7/9. Cell, 111, 105.

45. Zhang, X., Tamaru, H., Khan, S., Horton, J., Keefe, L.,Selker, E. & Cheng, X. (2002). Structure of the Neuro-spora domain protein-5, a histone h3 lysine methyl-transferase. Cell, 111, 117.

46. Trievel, R., Beach, B., Dirk, L., Houtz, R. & Hurley, J.(2002). Structure and catalytic mechanism of adomain protein methyltransferase. Cell, 111, 91.

47. Xiao, B., Jing, C., Wilson, J. R., Walker, P. A., Vasisht,N., Kelly, G. et al. (2003). Structure and catalyticmechanism of the human histone methyltransferaseset 7/9. Nature, 652–656.

48. Rea, S., Eisenhaber, F., O’Carroll, D., Strahl, B. D.,Sun, Z. W., Schmid, M., et al. (2000). Regulation ofchromatin structure by site-specific histone h3 meth-yltransferases. Nature, 406, 593–599.

49. Nakayama, J., Rice, J. C., Strahl, B. D., Allis, C. D. &Grewal, S. I. (2001). Role of histone h3 lysine 9 meth-ylation in epigenetic control of heterochromatinassembly. Science, 292, 110–112.

50. Roey, V. P., Meehan, L., Kowalski, J. C., Belfort, M. &

Derbyshire, V. (2002). Catalytic domain structureand hypothesis for function of intron endonuclease-tevi. Nature Struct. Biol. 9, 806–811.

51. Derbyshire, V., Kowalski, J. C., Dansereau, J. T.,Hauer, C. R. & Belfort, M. (1997). Two-domain struc-ture of the td intron-encoded endonuclease-tevi cor-relates with the two-domain configuration of thehoming site. J. Mol. Biol. 265, 494–506.

52. Nurizzo, D., Turkenburg, J. P., Charnock, S. J.,Roberts, S. M., Dodson, E. J., Mckie, V. A. et al.(2002). Cellvibrio japonicus alpha-arabinanase 43a hasa novel five-blade beta-propeller fold. Nature Struct.Biol. 9, 665–668.

53. Davies, G., Sinnott, M. L. & Withers, S. G. (1997).Comprehensive Biological Catalysis, Academic Press,London.

54. Berger-Bachi, B., Barberis-Maino, L., Strassle, A. &Kayser, F. H. (1989). FemA, a hostmediated factoressential for methicillin resistance in Staphylococcusaureus: molecular cloning and characterization. Mol.Gen. Genet. 219, 263–269.

55. Lovgren, J. M. & Wikstrom, P. M. (2001). The rlmBgene is essential for formation of gm2251 in 23 srRNA but not for ribosome matration in Escherichiacoli. J. Bacteriol. 183, 6957–6960.

56. Michel, G., Sauve, V., Larocque, R., Li, Y., Matte, A. &Cygler, M. (2002). The structure of the rlmB 23 srRNA methyltransferase reveals a new methyltrans-ferase fold with a unique knot. Structure (Camb), 10,1303–1315.

57. Gustafsson, C., Reid, R., Greene, P. J. & Santi, D. V.(1996). Identification of new modifying enzymes byiterative genome search using known modifyingenzymes as probes. Nucl. Acids Res. 24, 3756–3762.

58. Persson, B. C., Jager, G. & Gustafsson, C. (1997). Thespou gene of Escherichia coli, the fourth gene of thespot operon, is essential for tRNA (gm 18) 2(-methyl-transferase activity. Nucl. Acids Res. 25, 4093–4097.

Edited by B. Honig

(Received 27 November 2002; received in revised form 26 February 2003; accepted 2 April 2003)

734 Predicting Active Sites Using Neural Networks