A new principle for macromolecular structure determinationis.tuebingen.mpg.de/fileadmin/user_upload/files/publications/maxent… · Abstract. Protein NMR spectroscopy is a modern

A new principle for macromolecular structuredetermination

Michael Habeck∗, Wolfgang Rieping∗ and Michael Nilges∗

∗Institut Pasteur, 25-28, rue du Dr. Roux, 75015 Paris, France

Abstract. Protein NMR spectroscopy is a modern experimental technique for elucidating the three-dimensional structure of biological macromolecules in solution. From the data-analytical point ofview, structure determination has always been considered an optimisation problem: much effort hasbeen spent on the development of minimisation strategies; the underlying rationale, however, hasnot been revised. Conceptual difficulties with this approach arise since experiments only provideincomplete structural information: structure determination is an inference problem and demandsfor a probabilistic treatment. In order to generate realistic conformations, strong prior assumptionsabout physical interactions are indispensable. These interactions impose a complex structure onthe posterior distribution making simulation of such models particularly difficult. We demonstrate,that posterior sampling is feasible using a combination of multiple Markov Chain Monte Carlotechniques. We apply the methodology to a sparse data set obtained from a perdeuterated sample ofthe Fyn SH3 domain.

INTRODUCTION

Biological macromolecules, such as proteins or DNA, adopt a unique thermodynam-ically stable conformation in solution. Nuclear Magnetic Resonance (NMR) spec-troscopy enables one to obtain the native conformation under physiological condi-tions with atomic resolution. Nuclear Overhauser Enhancements (NOE) are the primarysource of structural information. The NOE spectroscopy experiment [1] (NOESY) mea-sures dipolar relaxation rates. These depend on the distancedi j between two protonsi and j that establish magnetisation transfer during dipolar relaxation. Each resonancepeak occurs at specific frequencies. One can therefore assign NOE cross-peaks to pairsof protons. The widely used Isolated Spin Pair Approximation (ISPA) relates the size ofthe resonance peak to the inverse sixth power of the inter-atomic distance [2]:

Vi j ∝ d−6i j . (1)

As NOE measurements only provide distance information for hydrogen atoms with adistance below 5 Å, structure determination from NMR data is underdetermined: datasets are sparse and not sufficient in their own right to determine a structure uniquely.In order to define a physically reasonable conformation, experimental data need to becompleted with physical information about the covalent and non-covalent structure ofthe macromolecule.

NMR STRUCTURE CALCULATION BY MINIMISATION

The first methods for structure calculation from NMR data introduced the conceptof “geometrical consistency” [3, 4]: data are not interpreted directly but convertedinto “structural constraints” using a relation like ISPA. This theory, however, does notaccount for systematic effects such as internal dynamics, experimental and processingerrors – yet they affect measured peak sizes. Therefore, common practice is to interpretNOE measurements conservatively as providing only distance ranges. Geometricallyconsistent structures are defined as exhibiting distances within the allowed ranges.Structure determination boils down to the following task:

“Find a physically realistic structure that matches the distance constraints.”

Algorithmically, a “pseudo energy” is introduced consisting of the potential energyof the system and a term that quantifies how well a structure fits the set of distanceconstraints:

Epseudo= Ephys+kEdata.

Edata penalises constraint violations,Ephys is a physical potential energy serving as aregulariser that compensates for the sparseness of data. In order to weight the data withrespect to the potential energy, an additional free parameterk needs to be introduced.The global minimum of the pseudo energy is considered to be the native structure. Non-linear optimisation algorithms like Molecular Dynamics based Simulated Annealing [5]are able to locate the optimal conformation.

CONCEPTUAL PROBLEMS

Optimisation is used since the early days of NMR structure calculation. Much effort hasbeen spent on improving algorithms – the very rationale, however, has not been revised.The approach lacks a principle for formulating and interpreting the objective function.It has to rely on heuristics and is in this respect qualitative and indeterminate.

The precise meaning of the restraint energy is unclear. Though measured in thesame units, it does not possess an equally fundamental status as a physical energyand is merely introduced for data analysis purposes. No guideline for determining itsfunctional form exists. Moreover, auxiliary parameters, like the data weightk, need tobe introduced. As a matter of principle, such parameters are not measurable. They haveto be chosen heuristically as no general rule for their determination exists.

The minimisation approach itself is inappropriate when multiple structures are com-patible with the data: optimisation algorithms aim at finding a single global minimum;sub-optimal, yet equally important conformations may be missed. Another issue is thatoptimisation offers no way to judge the reliability of a calculated structure nor to modelexperimental errors and assess their influence on the result.

These flaws originate in an inherently inadequate formulation of the structure deter-mination problem. Optimisation methods cannot deal with incomplete information; theyhave to rely on heuristics to compensate for this deficiency.

INFERENTIAL STRUCTURE DETERMINATION

Structure determination has never been perceived as an inference problem. Imperfector incomplete data render the assumption of a single, “true”, conformation of a macro-molecule meaningless. Yet, the optimisation approach attempts to calculate that con-formation. A meaningful question, however, is to what extent the available informationdetermines the molecular conformation. Therefore, we give up the viewpoint “nativestructure = minimum energy conformation” and address the more general question:

“How plausible is a conformation given data and relevant background infor-mation?”

The posterior probabilityP(structure|NMR spectra, physics, . . .) is the quantitativeanswer. We propose to solve any structure determination problem by deriving and alsosimulating this probability [6]. This principle is not limited to NMR data but likewiseapplies to other experimental techniques, homology modelling andab initio structureprediction. In order to take full advantage of the Bayesian formulation, it is necessaryto employ posterior sampling techniques. Point estimations like the Maximum PosteriorApproximation or Maximum Likelihood are convenient and easy to implement but tendto foil the generality of the Bayesian solution.

PROBABILISTIC MODELLING OF NMR DATA

Strong prior information is indispensable for inferring macromolecular structures sincedistance information can only be measured for a small subset of all atoms. Given itsamino acid sequence, the covalent geometry of the protein is known. As an approxima-tion, we consider all covalent forces as infinitely strong, i.e. fix covalent parameters suchas bond lengths, bond angles and the planarity of certain groups of atoms. The remain-ing degrees of freedom are dihedral anglesθ = {θi} describing rotations about covalentbonds. Using this approximation, the number of degrees of freedom reduces by an orderof magnitude. Interactions with the solvent, though important for proteins to fold, havelittle influence on the overall quality of NMR structures and are furthermore expensiveto calculate. We neglect solvent effects and only consider van der Waals interactionsbetween all pairs of atoms using an approximation of the Lennard-Jones potential [7]:

E(θ) = ∑i< j

Ei j (di j (θ)), Ei j (di j (θ)) ={ 1

2ki j (di j −di j (θ))4, di j (θ) < di j0 , di j (θ) > di j

If the macromolecular system is at temperatureβ−1, the prior density expressing theabove-mentioned background informationI is the Boltzmann distribution [8]

p(θ |I) =1

Z(β )exp(−βE(θ)) . (2)

Numerical values for the parametersdi j andki j are taken from the PROLSQ [7] forcefield.

According to the ISPA (1), the sizeVi of the i-th resonance is proportional to thecross-relaxation rate which is itself proportional to the inverse sixth power of the dis-tancedi(θ) = ‖ai − bi‖ between two protons with coordinatesai andbi , respectively.The proportionality constantγ is unknown. A NOESY spectrum consisting ofn as-signed resonances results in a set of peak sizesD = {V1, . . . ,Vn}. We use a lognormaldistribution as likelihood for observing a resonance of sizeVi :

p(Vi |θ ,γ,σ , I) =1√

2πσ2Vi

exp

(− 1

2σ2 log2(Vid6i (θ)/γ)

). (3)

A single parameterσ describes the discrepancy of theoretical and observed peak sizes.It accounts for experimental and processing errors as well as shortcomings of the ISPA.

Modelling NOESY resonances requires the introduction of two nuisance parametersγ andσ . Assuming that knowledge ofθ does not bear any information onγ andσ , andfollowing Jeffreys’ recommendation, the joint prior is

p(θ ,γ,σ |I) =1

γ σ

1Z(β )

exp(−βE(θ)) . (4)

We are now in the position to make inferences on all unknown parameters. Optimisationmethods lack a general principle for handling nuisance parameters; at best, they canresort toad hocmethods like cross-validation [9].

POSTERIOR SAMPLING

According to Bayes’ theorem, the posterior distribution for all unknown parameters is

p(θ ,γ,σ |D, I) ∝ σ−(n+1)

γ−1 exp

(− 1

2σ2

n

∑i=1

log2(Vid6

i (θ)/γ)−βE(θ)

). (5)

For typical proteins, the posterior is complex and needs to be analysed numerically.Markov Chain Monte Carlo [10] is an efficient method to explore high-dimensionalprobability distributions. We use Gibbs sampling [11] to simulate the joint posterior.Nuisance parameters are directly sampled from their conditional posterior distributionsusing standard random number generators.

Hybrid Monte Carlo

The conformational conditional posterior is characterised by (i) its high dimensional-ity (medium-sized proteins exhibit several hundred conformational degrees of freedom),(ii) its high degree of correlation (mostly due to the non-covalent interactions) and (iii)its multimodality. These properties lead to a complex topology of the conformationalposterior which cannot be dealt with using standard simulation techniques.

Gradient-based methods are very efficient for searching conformational space ofmacromolecules. We therefore use Hybrid Monte Carlo (HMC) [12] to update thetorsion angles. The key idea of HMC is to combine standard Metropolis Monte Carlo[13] and Molecular Dynamics to deal with correlated variables and to produce non-localproposal states while maintaining high acceptance rates.

Replica exchange Monte Carlo

The conformational posterior exhibits narrow modes separated by extended regionsof low probability. These modes correspond to side-chain rotamers, multiple loop con-formations or even different folds that are similarly compatible with the data and theforce field. Due to the jaggedness of the posterior, the Markov chain is likely to becometrapped when using HMC. To address this problem we use replica exchange Monte Carlo[14]: in order to facilitate jumps between separated modes the posterior is broadened bya transformation involving temperature-like parameters. This distribution is “cooled”down using a chain of distributions, ordered according to their “temperatures”. Eachsuch heat-bath has its own copy of parametersθ ,γ,σ and is simulated independently.After a certain number of steps, samples are exchanged between neighbouring replicae.The Metropolis criterion is applied to decide whether the exchange shall be accepted.The rate of exchanges is determined by the overlap of the posterior distributions ofneighbouring replicae. Therefore, a trade-off between the efficiency of the exchangescheme and the number of copies and has to be made.

Unlike the situation in typical parameter estimation problems, the prior distributionhere poses a greater problem than the likelihood. Computational problems become ap-parent when attempting to simulate the canonical distribution of a biopolymer [15].Furthermore, the likelihood favours compact conformations. Global conformationalchanges require partial unfolding of the peptide chain since van der Waals repulsionhinders atoms from passing through each other. We therefore introduce two replica pa-rameters and simulate the family of distributions:

f (θ ,γ,σ ;q,λ ) ∝ [p(θ ,γ,σ |D, I)]λ [1+(q−1)β E(θ)]q

1−q (γσ)−1. (6)

The first factor is a weighted likelihood whereλ controls the importance of the data set:for λ = 1 the data are switched on, forλ = 0 they are switched off. The middle term isthe Tsallis ensemble [16, 17]; it has useful algorithmic properties: forq = 1 the Tsallisensemble is identical to the Boltzmann ensemble, forq > 1 high energy conformationsare no longer suppressed exponentially thus allowing for large van der Waals overlaps.

The target posterior distribution isf (·;q = 1,λ = 1), the “highest temperature” dis-tribution f (·;q→ ∞,λ = 0) considers only the covalent geometry of the chain molecule(uniformly distributed torsion angles). If two neighbouring heat-baths have parameters(q,λ ) and(q′,λ ′), respectively, states are exchanged with acceptance probability

min

{1,

f (θ ′,γ ′,σ ′;q,λ ) f (θ ,γ,σ ;q′,λ ′)f (θ ,γ,σ ;q,λ ) f (θ ′,γ ′,σ ′;q′,λ ′)

}. (7)

FIGURE 1. Scheme of the replica exchange MC algorithm: in the first chain, a data-weight transformsthe likelihood in order to gradually switch off the data. In the second chain, the Tsallis ensemble allowsatoms to move freely. The “low-temperature” bath (λ = 1.0, q = 1.0) corresponds to the conformationalposterior: the sampled conformations scatter around the native structure. At the split point (λ = 0.02,q = 1.0) structures are drawn from the Boltzmann ensemble. If both data and physical interactions areswitched off (λ = 0.02, q = 1.1) posterior samples only fulfil the covalent structure. At the split point,torsion angles are restricted since van der Waals interactions prevent atoms from overlapping (inset). Incase of vanishing non-bonded interactions torsion angles are drawn uniformly (“covalent geometry”).

We split the replica chain into two parts (see Fig. 1): in the first halfq is fixed to one(i.e. the full prior is considered) and the data are slowly switched-off by decreasingλ ;in the second half the data remain inactive and the Boltzmann distribution is graduallydeformed into a flat distribution by increasingq.

APPLICATIONS

We applied inferential structure determination to calculate the structure of the Fyn SH3domain (59 amino acids, 921 atoms). The data set consists of 154 NOEs obtained fromNOESY experiments on a perdeuterated sample [19]. A parallel replica simulation wasperformed using 40 copies for a target temperature of 300 K. The Tsallis parameterq ranged from 1.0 to 1.1, the data weightλ from 0.25 to 1.0. As a trade-off betweencomputational costs and random walk behaviour, we used 250 MD steps per HMCstep; every replica transition consisted of 30 HMC/Gibbs steps. In total, 20000 Fyn SH3conformations were sampled from the posterior distribution (5). The simulation reached

FIGURE 2. MOLMOL [18] ribbon plots of the X-ray structure of the Fyn SH3 domain (left hand) andthe most probable conformation (right hand) with a backbone heavy atom rmsd of 1.5 Å to the X-raystructure.

convergence after a burn-in of 10000 replica transitions (data not shown). Figure 2 showsthe most probable structure with a backbone heavy atom rmsd1 of 1.5 Å to the crystalstructure [20] which is significantly lower than the result reported in [19] (backboneheavy atom rmsd 2.9 Å).

Conformational uncertainty

The assessment of conformational uncertainty of NMR structures is an ongoing issue[21, 22]. Most works attempt to construct algorithms for calculating “precision fac-tors” based on NMR structure “ensembles”. However, neither a profound definition ofstructural uncertainty is offered, nor is the fact considered that standard NMR structureensembles are generatedvia energy minimisation which renders them inadequate todefine uncertainty whatsoever. The marginal conformational posterior

p(θ |D, I) =∫

dγdσ p(θ ,γ,σ |D, I), (8)

however, represents the idea of an “NMR structure ensemble” in a mathematically cleanway. It expresses the complete spatial uncertainty of an NMR structure. It is independentof algorithmic properties and depends only on the knowledge at hand: experimental dataand background information. If the conformational posterior is known, any hypothesisconcerning structural properties can be tested. Figure 3 illustrates the conformationaluncertainty of the Fyn SH3 NMR structure: 500 conformers, taken from the 68%confidence region, were superimposed and represented as a “sausage” plot. Only few

1 root mean square deviation: the standard measure for the distance between two molecular structures,AandB: rmsd2(A,B)≡ n−1 ∑n

i=1‖ai −bi‖2

FIGURE 3. MOLMOL plot of 500 superimposed conformers drawn from the 68% confidence regionof the Fyn SH3 posterior distribution. The atom positions of both the termini (top) and the loop regions(bottom and right hand) show significant uncertainties.

NOEs involve the termini and the loop regions: the respective atom positions are subjectto significant uncertainties. Analysis of the Fyn SH3 posterior also demonstrates thatminimum energy structures tend to overestimate accuracy (Fig. 4): though the mostprobable conformation is close to the X-ray structure, the 68% confidence region ofthe rmsd distribution is shifted towards larger values; it ranges between 1.8 Å and2.3 Å. Conventional NMR structure ensembles lack a statistical basis and furthermoredepend on properties of the underlying minimisation strategy. The “precision” of theseensembles depends on the numerical values of nuisance parameters and can be tweakedat will by altering the algorithmic settings.

FIGURE 4. The backbone heavy atom rmsd to the Fyn SH3 X-ray structure. Normalized histogrambased on 10000 Fyn SH3 posterior samples (solid line). Energy minimised structures tend to overestimateaccuracy: the most probable structure (Pmax, dashed line) is close the X-ray structure; structural uncer-tainty, however, suggests significantly higher rmsd values: the 68% confidence interval spans from 1.8 Åto 2.3 Å (dotted line).

FIGURE 5. Cross-validatory vs. Bayesian treatment of nuisance parameters: the graph of the free R-factor suggests to choosek from the “elbow region”; the marginal posterior of the inverse variance peaksaround this region, thus naturally reproducing the results of a laborious cross-validation.

Treatment of nuisance parameters

Posterior samples can be used to estimate the marginal posterior distribution of anyparameter. As a simple example consider the inverse variance of the likelihood,σ−2. Itsmarginal posterior distribution is shown in Fig. 5 indicating that the data are insufficientto define a unique weight (k = σ−2). Yet the optimisation approach requires a constantdata weight. The most objective way of choosingk is cross-validation [9]: we dividedthe data set into a “working” set (90 % of the data) and a “test” set (remaining 10 %). Atvaryingk, the working set was used for structure determination, the test set for judgingthe choice ofk. Plotting the R-factor (aχ2-like measure [9]) of the test set versuskshows that the predictive power of the hybrid energy increases with increasing weightup to a final residual error. It seems reasonable to choosek from the “elbow region”: theprediction error is minimal while the structure is least distorted by overfitting the data.The marginal posterior distribution of the inverse variance peaks around this region, thusnaturally reproducing the results of a laborious cross-validation.

However, using cross-validation for the analysis of sparse NMR data is problematicas the size of the data set used for structure calculation is even sparser. Furthermore, thecomputational effort grows exponentially with the number of nuisance parameters. Inthe probabilistic treatment the number of unknown parameters is not critical as long asthe posterior distribution is still proper.

CONCLUSIONS

Structure determination from NMR data is an inference problem and necessitates theapplication of probabilistic concepts. The conformational posterior distribution is themost complete description of our state of knowledge after experimentation. We proposeto solve any structure determination problem probabilistically and show that a fullyBayesian estimation of the macromolecular structure is computationally feasible byMarkov Chain Monte Carlo. Auxiliary parameters, such as weights or proportionality

constants, are treated in conjunction with the macromolecular structure; there is no needto set them to values that might not conform to the data. The application of probabilisticprinciples allows one to clarify and revise heuristic concepts that are ever since usedin structure calculation: the correct definition of the “NMR structure ensemble” isthe conformational posterior distribution, precision factors are derived on its grounds;weights are interpreted as inverse variances of the likelihood. Optimisation methodsare often misused for answering questions that are only meaningful in a probabilisticcontext. This gives the flawed impression that optimisation does similar things withmuch less effort than a truly Bayesian treatment.

ACKNOWLEDGMENTS

The work was supported by the EU grant QLG2-CT-2000-01313.

REFERENCES

1. Macura, S., and Ernst, R. R.,Molecular Physics, 41, 95–117 (1980).2. Neuhaus, D., and Williamson, M. P.,The nuclear Overhauser effect in structural and conformational

analysis., VCH Publishers Inc., New York, 1989.3. Havel, T., Kuntz, I. D., and Crippen, G. M.,Bull. Math. Biol., 45, 665–720 (1983).4. Braun, W., Wider, G., Lee, K. H., and Wüthrich, K.,J. Mol. Biol., 169, 921–948 (1983).5. Nilges, M., Gronenborn, A. M., Brünger, A. T., and Clore, G. M.,Protein Eng., 2, 27–38 (1988).6. Rieping, W., Habeck, M., and Nilges, M., “Structure calculation from NMR data – a Bayesian view,”

in NMR analysis of protein structure, edited by M. Sattler, M. Nilges, and H. Oschkinat, Springer-Verlag, Heidelberg, 2003, to appear.

7. Hendrickson, W. A.,Methods in Enzymology, 115, 252–270 (1985).8. Jaynes, E. T.,Phys. Rev. Lett., 106, 620–630 (1957).9. Brünger, A. T., Clore, G. M., Gronenborn, A. M., Saffrich, R., and Nilges, M.,Science, 261, 328–331

(1993).10. Neal, R. M., Probabilistic Inference Using Markov Chain Monte Carlo Methods, Tech. Rep. CRG-

TR-93-1, Department of Computer Science, University of Toronto (1993).11. Geman, S., and Geman, D.,IEEE Trans. PAMI, 6, 721–741 (1984).12. Duane, S., Kennedy, A. D., Pendleton, B., and Roweth, D.,Phys. Rev. Lett. B, 195, 216–222 (1987).13. Metropolis, N., Rosenbluth, M., Rosenbluth, A., Teller, A., and Teller, E.,J. Chem. Phys., 21, 1087–

1092 (1957).14. Swendsen, R. H., and Wang, J.-S.,Phys. Rev. Lett., 57, 2607–2609 (1986).15. Hansmann, U. H. E., and Okamoto, Y.,Curr. Opin. Struct. Biol., 9, 177–183 (1999).16. Tsallis, C.,J. Stat. Phys., 52, 479–487 (1988).17. Hansmann, U. H. E., and Okamoto, Y.,Phys. Rev. E, 56, 2228–2233 (1997).18. Koradi, R., Billeter, M., and Wüthrich, K.,J. Mol. Graph., 14, 51–55 (1996).19. Mal, T. K., Matthews, S. J., Kovacs, H., Campbell, I. D., and Boyd, J.,J. Biomol. NMR, 12, 259–276

(1998).20. Noble, M., Musacchio, A., Saraste, M., and Wierenga, R.,EMBO J., 12, 2617–2624 (1993).21. Zhao, D., and Jardetzky, O.,J. Mol. Biol., 239, 601–607 (1994).22. Spronk, A. E. M., Nabuurs, S. B., Bovin, M. J. J., Krieger, E., Vuister, W., and Vriend, G.,J. Biomol.

NMR, 25, 225–234 (2003).

A new principle for macromolecular structure determinationis.tuebingen.mpg.de/fileadmin/user_upload/files/publications/maxent… · Abstract. Protein NMR spectroscopy is a modern

Documents