-
Atomistic Protein FoldingSimulations on theSubmillisecond Time
ScaleUsing Worldwide DistributedComputing
Vijay S. Pande14
Ian Baker1
Jarrod Chapman4,*Sidney P. Elmer1
Siraj Khaliq5
Stefan M. Larson2
Young Min Rhee1
Michael R. Shirts1
Christopher D. Snow2
Eric J. Sorin1
Bojan Zagrovic2
1 Department of Chemistry,Stanford University,
Stanford, CA 94305-5080
2 Biophysics Program,Stanford University,
Stanford, CA 94305-5080
3 Department of StructuralBiology,
Stanford University,Stanford, CA 94305-5080
4 Stanford SynchrotronRadiation Laboratory,
Stanford University,Stanford, CA 94305-5080
5 Department of ComputerScience,
Stanford University,Stanford, CA 94305-5080
Received 27 March 2002;accepted 22 May 2002
Abstract: Atomistic simulations of protein folding have the
potential to be a great complement toexperimental studies, but have
been severely limited by the time scales accessible with
currentcomputer hardware and algorithms. By employing a worldwide
distributed computing network oftens of thousands of PCs and
algorithms designed to efciently utilize this new
many-processor,highly heterogeneous, loosely coupled distributed
computing paradigm, we have been able tosimulate hundreds of
microseconds of atomistic molecular dynamics. This has allowed us
to directly
Correspondence to: Vijay S. Pande; email: [email protected]
*Present address: Physics Department, University of
California,Berkeley, CA
Contact grant sponsor: ACS PRF, NSF MRSEC CPIMA, NIHBISTI, ARO,
and Stanford University
Contact grant numbers: 36028-AC4 (ACS PRF), DMR-9808677(NSF
MRSEC CPIMA), IP20 6M64782-01 (NIH BISTI), 41778-LS-RIP
(ARO)Biopolymers, Vol. 68, 91109 (2003) 2002 Wiley Periodicals,
Inc.
91
-
simulate the folding mechanism and to accurately predict the
folding rate of several fast-foldingproteins and polymers,
including a nonbiological helix, polypeptide -helices, a -hairpin,
and athree-helix bundle protein from the villin headpiece. Our
results demonstrate that one can reach thetime scales needed to
simulate fast folding using distributed computing, and that
potential sets usedto describe interatomic interactions are
sufciently accurate to reach the folded state with exper-imentally
validated rates, at least for small proteins. 2002 Wiley
Periodicals, Inc. Biopolymers68: 91109, 2003
Keywords: atomistic protein folding; microsecond time scale;
computer hardware; computeralgorithms; molecular dynamics;
distributed computing; villin; beta hairpin
INTRODUCTION
Understanding the sequencestructure relationship ofproteins will
play a pivotal role in the postgenomicera, and will have great
impact on genetics, biochem-istry, and pharmaceutical chemistry.13
A detailedpicture of the folding process itself will be importantin
understanding diseases, such as Alzheimers andvariant
CreutzfeldtJacob disease, believed to be re-lated to protein
misfolding.4 Finally, an understandingof protein folding mechanisms
will be important inprotein design and nanotechnology, in which
self-assembling nanomachines may be designed usingsynthetic
polymers with protein-like folding proper-ties.5
Unfortunately, current computational techniques totackle protein
folding simulations are fundamentallylimited by the long time
scales (from a simulationpoint of view) needed to study the
dynamics of inter-est. For example, while the fastest proteins fold
on theorder of tens of microseconds, current single com-puter
processors can only simulate on the order of ananosecond of
real-time folding in full atomic detailper CPU daya
10,000-fold-computational gap.Great strides in traditional parallel
molecular dynam-ics (MD), utilizing many processors to speed a
singledynamics simulation, have been made and have par-tially
overcome this divide. A tour-de-force parallel-ization of
simulation code for supercomputers by Duanand Kollman has
previously led to the simulationof 1 s of dynamics for the villin
headpiece three-helix bundle,6 demonstrating that
parallelizationschemes using hundreds of processors can be used
tomake signicant progress at closing this computa-tional gap.
However, such methods have fundamentaldrawbacks: in particular,
these methods require com-plex, expensive supercomputers due to the
need forfast communication between processors. Moreover,due to the
stochastic nature of folding, in order tostudy the folding of a
10-s folder, one must simulatehundreds of microseconds, requiring
computingpower equal to thousands or tens of thousands oftodays
processors.
Developing such large-scale parallelization meth-ods is very
difcult, and current parallelizationschemes cannot scale to the
level of even thousands ofprocessors (i.e., cannot use so many
processors ef-ciently). To understand why scalability to
thousandsof processors is so difcult, consider an analogy to
agraduate student thesis. A typical thesis takes 1500graduate
student days. If one employed 1500 graduatestudents to accomplish
this goal, would it be possibleto complete a thesis in a single
day? Clearly nottheoverhead of communication between students, as
wellas the inability to devise an algorithm to divide thelabor
evenly, would make 100% efciency impossiblein this case. At this
level of scaling, it is likely that thework would actually take
longer. These issues, inparticular balancing communication time
against timespent actually doing work, are mirrored in the
divisionof labor between computer processors. Clearly, theonly way
to efciently utilize such a large number ofprocessors is to divide
work in such a way that re-quires minimal communication.
Even with an algorithm with perfect scalability(e.g., with a
10,000-fold increase in speed using10,000 processors), we are still
left with the problemof obtaining a 10,000-processor supercomputer.
Forcomparison, the largest unclassied supercomputer inthe world
(the SP at NERSC) has 2500 processors,and of course this resource
must be shared betweenmany (hundreds) different research groups.
Recently,another approach has been developed to bridge thisenormous
computational gap: worldwide distributedcomputing.7 There are
hundreds of millions of idlePCs potentially available for use at
any given time,the majority of which are vastly underused.
Thesecomputers could be used to form the most powerfulsupercomputer
on the planet by several orders ofmagnitude.7 However, to tap into
this resource ef-ciently and productively, we must employ
nontradi-tional parallelization techniques.8,9 Indeed, we wish
toaccomplish a seemingly impossible goal: to push thescalability of
MD simulations to previously unattain-able levels (i.e., the
efcient use of tens of thousands
92 Pande et al.
-
of processors) using an extremely heterogeneous net-work of
processors that are loosely coupled by rela-tively low-tech
networking (primarily modems).
In this review, we rst present the details of ourmethod to
simulate protein folding using distributedcomputing, and then
summarize our folding simula-tion results for several small, fast
folding proteins andpolymers. Specically, we demonstrate the
applica-bility of our method by simulating the folding ofprotein
helices10,11 in atomic detail. Next, as an addi-tional quantitative
test of our methodology, we exam-ine the folding rate and folding
time distribution of anonbiological helix,12 for which results of
traditionalMD simulations13 and experiments14 are known. Wenally
apply these methods to larger and more com-plex proteins, including
a -hairpin15,16 and a three-helix bundle.6 We conclude with an
assessment of thevalidity of our methods including a quantitative
com-parison of these results with experimental measure-ments of
folding rates and equilibrium constants, anda discussion of what we
have learned about foldingmechanisms.
METHODS
Why Is the Dynamics of ComplexSystems so Slow and How Can This
BeCircumvented?
The dynamics of complex systems typically involves the cross-ing
of free energy barriers.17 It has been demonstrated18,19that free
energy barrier crossing dynamics, such as proteinfolding, does not
make steady, gradual progress from onestate to another (such as
dynamics from the unfolded tofolded states during a folding
transition), but rather spendsmost of the trajectory time dwelling
in a free energy mini-mum, waiting for thermal uctuations to push
the systemover a free energy barrier. Indeed, this process is
dominatedby the waiting time, and the time to cross the free
energybarrier is in fact much shorter than the overall folding
time,typically by several orders of magnitude.20 This opens thedoor
to the possibility that one may simulate complexprocesses, such as
folding, using trajectories much shorterthan the folding time20
(i.e., using nanosecond simulationsto reproduce kinetics with
microsecond rate constants).
Methods to exploit this observation have been previ-ously
developed (the most notable being path sampling, inwhich one
simulates the paths over the barrier, rather thanthe time spent
waiting in the original free energy minimum),with promising
results.2022 However, several technicalcomplications have limited
the use of these methods insimulating protein folding and in making
quantitative com-parisons to experiment. First, some of these
methods requirethat the path in question not dwell in metastable
states andthus may get stuck in local meta-stable free energy
minima
along the pathway (which have been found in many foldingand
unfolding simulations19). Second, these methods re-quire knowledge
of the native state as an end goal and in asense apply a eld to the
trajectories to reach this nativestate. Finally, in the end, the
heart of the protein foldingproblem lies in sampling, and even with
the great benets ofpath sampling methodology, a tremendous degree
of sam-pling (in this case, in path space) must still be
performed.
Thus, the main goal of the method we have developed isto
simulate folding dynamics starting purely from the pro-tein
sequence and an atomistic force eld, without using anyknowledge of
the native state in the folding simulation. It isimportant that our
method can successfully tackle the issueof lingering in metastable
free energy minima. To achievethis, we use the following algorithm
(ensemble dynam-ics). Consider running M independent simulations
startedfrom a given initial condition (each run starts from the
samecoordinates, but with different velocities). We next wait
forthe rst simulation to cross the free energy barrier (seeFigure
1). Since the average time for the rst of M simula-tions to cross a
single barrier is M times less than theaverage time for all the
simulations (assuming an exponen-tial distribution of barrier
crossing times8,9; see below fordetails), we can use M processors
to effectively achieve anM times speedup of a dynamical simulation,
thus avoidingthe waiting in free energy minima. In a sense, we
distributethe waiting to each processor in parallel, rather than
inseries, as in traditional parallel MD. Given the ability
toidentify individual barrier crossings, one can then speed
theentire (multiple barrier) problem by turning it into a seriesof
single barrier problems, restarting the processors from thenew free
energy minima after each barrier crossing (seebelow and Refs. 8 and
9). Also, it can be shown that oneneed not use identical computers
for these calculations, animportant fact in employing heterogeneous
public clus-ters.8,9
To more quantitatively see how one can use thesesimulations to
examine events that occur on considerablylonger time scales,
consider a protein with single expo-nential kinetics, where the
fraction that fold in time t isgiven by
ft 1 exp(k t)
where k is the folding rate. On average, a folding event
willoccur on the 1/k time scale. However, we expect to see
somefolding events even at short times compared to the foldingrate,
i.e., when kt is small. In this case, we have f(t) kt.How many
folding events would we expect to see? Considerstudying a protein
that folds with k 1/10,000 ns, given M 10,000 simulations each of
length t 30 ns we wouldexpect to see M f(t) M k t 30 folding
events.
The above discussion shows how one can speed dynam-ics by a
factor of M for a single barrier system, but whatabout multiple
barrier problems? To handle multiple barri-ers, we suggest a scheme
similar to Voters parallel replicamethod.9 This method is
summarized in Figure 1. We startM simulations from a single initial
condition, and then wait
Protein Folding Simulations on the Submillisecond Time Scale
93
-
for the rst simulation to cross a free energy barrier. Oncethis
simulation has crossed over to the next free energyminimum, we
restart all other simulations from that newlocation in conguration
space, restart the dynamics, andwait for another replica to cross
another free energy barrier.Since we employ stochastic dynamics,
even though allsimulations are restarted from the same conguration,
theyquickly decorrelate from each other and explore
differentregions of the phase space.
Quantitative Rate Prediction
Below, we show how this scheme can be used to predictfolding
rates. This result stems from the fact that the distri-bution of
barrier crossing times for the rst-to-cross isdirectly linked to
the distribution of usual barrier crossingtimes. To demonstrate
this, we again assume single expo-nential kinetics (deviations from
single exponential kineticshave also been examined elsewhere8). For
a single proces-sor, we expect that a particular simulation would
havecrossed the barrier by time t with a probability
P1t k exp(k t)
For the M simulation case, the probability that the
rstsimulation has crossed in time t is
PM(t) k exp(k t] M 1 k 0
t
exp(k tdt]M1
(i.e., the probability that one simulation has crossed, times
adegeneracy factor of M, times the probability that the re-maining
M 1 simulations have not folded). Evaluation ofthe integral above
yields
PMt M k exp[M k t]
which is exactly the same distribution as the single proces-sor
case, except with an effective rate which is M timesfaster. Since
this method simply speeds the effective rate ofcrossing each
barrier, one can use the number of processorsM and the rate for rst
crossing to predict the experimentalrate.
The error inherent in the above procedure for calculatingthe
rate and the time constant can be estimated in thefollowing way. As
shown above, for a given Ntotal and agiven t, the rate k is simply
proportional to the number ofmolecules that have folded by time t,
Nfolded(t). Since eachfolding process behaves probabilistically
(according to anexponential distribution) and given xed t and
Ntotal, thenumber of processes that will fold by time t,
Nfolded(t), willbe a random variable. In other words, different
realizationsof the large experiment containing Ntotal individual
pro-cesses will, by their very nature, yield different values
ofNfolded(t) for a xed time t. From this it follows that our
rateestimate will also be associated with a certain
inherentuncertainity. From elementary probability theory, we
knowthat the number of folding events by time t, Nfolded(t), givena
constant rate, will be distributed according to the
Poissondistribution. This in turn means that the rate estimate,
whichis proportional to Nfolded(t), will also be distributed
accord-ing to the Poisson distribution. The standard deviation of
aPoisson distribution with rate is equal to 1/2, meaningthat our
rate estimate standard deviation will simply be
k Nfolded/Ntotalt Nfolded1/2/(Ntotalt)
FIGURE 1 Simulating dynamical events using worldwidedistributed
computing. Traditional methods utilize multipleprocessors to speed
a single dynamics calculation. We sug-gest that multiple processors
can be used to generate sets ofcalculations, and that the desired
thermodynamic or kineticobservables can be calculated from such an
ensemble. Thismethod does not require supercomputers and can run
wellon massively parallel clusters.79 Consider a multiple
bar-rier-crossing problem (a model of many complex phenom-ena).
Since the bulk of simulation time is spent waiting forthermal
uctuations to bring the system over the barriers,one can speed the
calculation by starting many simulationsin the rst free energy
minimum (a), and waiting until justone of them has crossed. At this
point, we couple thesimulations by placing them all in the same
place in theconguration space as the simulation that has crossed
thatbarrier (b). This process is then repeated as many times
asneeded to cross additional free energy barriers (c). One canshow
(see text) that this algorithm, with M processors, isequivalent to
a single processor system running M timesfaster.8,9 Thus, with
hundreds to thousands of processorsand assuming that one can
identify transitions, we would beable to bridge the computational
barriers currently limitingprotein folding and reach well into the
microsecondtime-scale.
94 Pande et al.
-
Standard propagation of error results in time constant standard
deviation of
1/k Ntotalt/Nfolded Ntotalt/Nfolded3/2
For example, for the -hairpin folding data (see below), wehave
Ntotal 2700, Nfolded 8, and t 14 ns, which resultsin k 2.1 105 0.74
104 s1, and 4.7 1.7 s.
We stress that as long as one can identify transitions,thus
allowing an M times speed-up for all barrier crossings,the dynamics
we simulate will faithfully follow the dynam-ics one would obtain
from traditional MD, but simply Mtimes faster. If there are
off-pathway traps, our method willgo to them; indeed, we will reach
them M times faster.However, we will escape these traps M times
faster as well.This method is not intended as a structure
prediction algo-rithm, but rather a means to speed dynamics and
study themechanism of folding, which may include on-pathway
in-termediates or diversions to traps.
How Can One Identify Free EnergyBarrier Crossings
(Transitions)?
Of course, the utility of this method rests on our ability
toidentify transitions, i.e., to calculate whether a simulationhas
crossed a free energy barrier. Voters parallel replicamethod was
intended to accelerate the dynamics of solid-state systems that
have energy barriers, in which one canidentify new states by
performing energy minimization tosee whether one has crossed an
energy barrier.9 However, inprotein folding (as well as many other
complex systems),the relevant barriers are free energy barriers,
and thus anenergy minimization technique is not applicable. In
orderfor this method to be applied to a broad range of
barriercrossing problems, one needs to use a more general way
toidentify free energy barrier crossings.
We suggest that, in analogy to rst-order phase transi-tions, one
could look for a large variance in energy, whichcan loosely be
related to a momentary surge in the heatcapacity (a common sign of
a rst-order phase transition).Such energy variance peaks have been
seen to coincide withfree energy barrier crossings in simple
models23 and all-atom (S. Perkins and V. Pande, unpublished
results) modelsof protein folding. This technique has the signicant
advan-tage that it does not require any knowledge of the
structureof the protein at the barrier.24 Moreover, to the extent
thatenergy variance peaks correctly identify transitions,
thesepeaks would aid in the interpretation of the
simulationresults, since they would demarcate transitions to new
freeenergy minima.
Of course, in the case of single-exponential kinetics, asis
experimentally observed for almost all small proteins,there exists
only one rate-determining free energy barrier,and thus recognizing
the barrier is not essential to thetechnique. In fact, although we
do not discuss it in thispaper, in some cases ignoring the
transitions can result inreaching the folded state faster than by
recognizing all thebarriers.8 Finally, simulating completely
independent trajec-
tories is another appealing possibility for systems withsingle
exponential kinetics; we discuss this possibility in theDiscussion
section below.
Simulation Details
For all of the molecules presented here, each simulation
isstarted from a completely extended state. This is done toavoid
any possibility of biasing the initial state toward thenative state
of the molecule. Clearly the extended state doesnot represent the
structure of the unfolded state. Indeed, wend that rapidlyi.e.,
within 13 ns of MD simulationthis extended state relaxes to the
unfolded state of theprotein. While this practice utilizes more
computationaltime than, for example, starting from some predicted
un-folded state, it has the virtue of not making any assumptionsof
the unfolded ensemble and removes any possibility ofbiasing the
system to the native state.
For each run, we have used M clone processors, eachsimulating
folding in atomic detail with molecular or sto-chastic dynamics
simulations (Figure 1). Once one of theseclones makes a transition
(identied by a spike in the energyvariance: see below), we declare
that the simulation hasgone through a transition, copy the
resulting congurationto all of the other processors, and recommence
simulationsfrom the new conguration. After restarting all
simulationsfrom the coordinates of the barrier-crossing simulation,
onemust ensure decorrelation of the next ensemble of trajecto-ries
in order to achieve an increase in computational speed.This process
is performed many times, over several runs.
In our simulations, the spatial coordinates of the
barrier-crossing simulation were copied and unique random
numberseeds (for Langevin dynamics random forces) were used
toimmediately differentiate the simulations. In a purely
deter-ministic simulation, one would need to differentiate
eachsimulation by restarting them with differing velocities,which
may lead to potentially nonphysical discontinuities inthe path;
however, if the velocity decorrelation time is muchshorter than the
conformational decorrelation time (certainlytrue for dynamics in
any water-like solvent), then the effectsare likely to be minimal.
In either case, the path obtainedwould correspond to a fast
traversal of the potential land-scape, but the total simulation
time among all processorswould be equivalent to the additional time
waiting in min-ima that a representative serial simulation would
take.
The Folding@Home distributed computing
system(http://folding.stanford.edu) was used for the two most
de-manding calculations (the -hairpin and villin
simulations)presented here. The Folding@Home client software
(whichperforms the scientic calculations) is based upon theTinker
molecular dynamics code,25 with numerous modi-cations performed by
Michael Shirts, other members of thePande group, Jed Pitera, and
Bill Swope. We simulatedfolding and unfolding at 300 K and at pH 7
(unless notedotherwise), using the OPLS26 parameter set and the
GB/SA27 implicit solvent model. Stochastic dynamics wereused to
simulate the viscous drag of water ( 91/ps), anda 2 fs integration
time step was used with the RATTLE
Protein Folding Simulations on the Submillisecond Time Scale
95
-
algorithm28 to maintain bond lengths. Long-range interac-tions
were truncated using 16 A cutoffs and 12 A tapers.
We identied transitions by a heat capacity spike asso-ciated
with crossing a free energy barrier. It has beenpreviously shown
that this is a means to identify transitionsin all-atom8 and
simplied protein model29 simulations. Tomonitor the heat capacity
during the simulation, we calcu-late the energy variance, and use
the thermodynamic rela-tionship Cv (E2 E2)/T, where E is the energy
of thesystem; note that since we are using an implicit solvent,
ourenergy is often called an internal free energy (the totalfree
energy except for protein conformational entropy).Each PC runs a
100 ps MD simulation (1 generation),calculates the energy variance
within this time period, andthen returns this data to the
Folding@Home server. If theenergy variance exceeds a preset
threshold value, the serveridenties this trajectory as having gone
through a transition,and then resets all other processors to the
newly reportedcoordinates. Since the heat capacity is extensive, we
used axed value of this threshold per atom (0.8 kcal2/mol2/atom)for
all molecules (which for example leads to a threshold of300
kcal2/mol2 for villin).
Since transitions occur relatively infrequently (see be-low),
one need not run these simulations on massivelyparallel
supercomputers (with high speed communication);instead, these
simulations are well suited for large, decen-tralized distributed
computing clusters, such as theFolding@Home project. Not only is
this a demonstrationthat such distributed computing clusters can be
used tostudy long time-scale kinetics with molecular dynam-icswe
stress that this is likely the only way such calcula-tions could
have been practically performed, considering thegreat computational
demands of these calculations.
RESULTS
Protein -Helices
To test the methodology presented above, we havesimulated the
folding of two different -helical pep-tides. One sequence we
examined, the Fs peptideAcA5(A3RA)3ANH2, has been shown
experimen-tally to have biexponential kinetics, with
characteris-tic times of 10 ns and 160 60 ns.10 Helices arebelieved
to form via nucleation,30,31 which is inu-enced by the disorder in
a system (either as a nucle-ation accelerator or blocker),
analogous to a liquidwith impurities. In our system, the arginine
residuescould be considered to be an analog of these impuri-ties
that blocks propagation, and it is interesting toconsider the role
of this disorder in the sequenceabove, i.e., whether the arginine
residues affect thenucleation processes. To address this, we have
alsofolded a pure poly-A chain, AcA20NH2.
We have been able to fold both of these proteinsequences and nd
rates comparable to experiment at
a temperature of 10C. The initial congurations werecompletely
elongated chains (135, 135).Qualitatively, both the poly-A helix
and the Fs pep-tide folded by rst undergoing nucleation followed
bypropagation toward the termini. We found propaga-tion in both
directions (N to C and C to N), althoughwe do not have sufcient
statistics to determine a biasin propagation direction.30
Quantitatively, the Fs pep-tide folded (i.e., reached 15 helical
residues, the valueexpected from experiment10) in 82 60 ns in
oursimulations. Note that while 7 runs were used for thisaverage, 2
of the 7 Fs peptide runs did not fold after160 ns. Since these runs
were included in the averageas folding in 160 ns, the average is
somewhat lowerthan it should be. However, since the
experimentalrate is 1/160 ns, one would expect that on average
4.4/7 [63% 1 exp(kt)] of the runs would fold after160 ns, whereas
we observed 5/7, which is well withinthe experimental bounds.11
Moreover, our simulations capture some ner de-tail about the
nature of folding. We see fast earlyevents, as found
experimentally. The alanine-rich N-terminal part of the Fs peptide
folded very quickly, in15 10 ns, consistent with the observation of
N-terminal uorophore quenching in 10 ns by Eaton andco-workers11
and the faster rate observed by Dyer andco-workers.10 The poly-A
helix folded considerablyfaster (18 8 ns, out of 8 runs) and
typically hadmore helical content (17.8 residues vs 15.1 for the
Fspeptide).
Apparently, the arginine residues are responsiblefor the
differences in these folding rates by acting asblockers of helix
nucleation and propagation. Lookingat the formation of secondary
structure vs. time (seeFigure 2, right), we see that helical
propagation haltsat the arginine residues (R) and often the
completionof helix formation requires additional nucleationevents.
While our eight poly-A helix runs did showstalling of propagation
(Figure 2), these events werenot localized to any particular point
in the chain. Whydoes arginine limit propagation? We suggest that
thelong Arg side chain signicantly limits its mobility,and moving
into a helical / orientation thus occursmuch more slowly.
Nonbiological Helices
How generally applicable and accurate is the coupledsimulation
method? To address this question, we haveapplied this method to
study the folding of a nonbio-logical helix, a 12-mer of
polyphenylacetylene(PPA).12 This polymer can be considered to be
anonbiological analog of polyalanine, since it is a ho-mopolymer
with a simple side chain that folds into a
96 Pande et al.
-
helix.12 We have previously shown13 that this poly-mer folds to
a helix on the tens of nanosecond time-scale, in accordance with
previous experimental ob-servations.14 We nd that our ensemble
dynamicsmethod works well for PPA. The mean folding timeand folding
time distribution are consistent with bruteforce, traditional
simulations of PPA.13 This is dem-onstrated by the agreement in
mean folding timesbetween the two methods and the similarity of
thefolding time distribution (see Figure 3 and Table 1).
C-Terminal -Hairpin of Protein G
-Helices and -hairpins together represent the mostubiquitous
secondary structural elements in proteins.In a previous section, we
discussed our simulationresults for helices and now we concentrate
on hair-pins. We have recently reported a full-atom,
implicit-solvent simulation of folding of the hairpin at a
bio-logically relevant temperature,32 and here we brieysummarize
those results. We have obtained a very
large ensemble of conformations, which includesmostly partially
folded structures, as well as eightcomplete, fully independent
folding trajectories.These data sets allow us to determine the key
trendscharacterizing the folding process and determine sev-eral
average properties that have been measured orcould, in principle,
be measured experimentally.
Based on our results, we can estimate the foldingrate of the
hairpin in the following way: we havesimulated 27 independent runs,
each consisting of M 100 clone simulations that, on average,
completedapproximately 14 ns of simulated time, bringing thetotal
to approximately 38 s of real time simulation.Out of 2700
simulations, we have detected eight com-plete folding events, which
(if we assume single ex-ponential folding kinetics) results in an
estimatedfolding time of approximately 4.7 s. This predictionis in
excellent agreement with the experimentallymeasured time of 6
s.16,32
Our results offer the following picture of the fold-ing
mechanism (Figure 4). Folding from a fully ex-
FIGURE 2 Folding simulation of -helices. Shown above are
trajectory data for simulations ofthe poly-A helix (left) and Fs
peptide (right). Top: number of helical units vs time (dotted line)
andenergy variance vs time. We see that peaks are associated with
nucleation events. Bottom: Secondarystructure formation vs time:
red, yellow, and blue denote helices, -sheets, and turns
respectively.In both cases, we see nucleation events (corresponding
to energy variance peaks). However, in thecase of the Fs peptide,
nucleation events did not occur at the arginine residues (R) and
propagationtypically was blocked at these residues (also seen in
the other seven runs we performed, data notshown). We estimated the
time by multiplying the directly simulated time t by the number
ofprocessors M (M 24 and M 128 for the left and right trajectories,
respectively).
Protein Folding Simulations on the Submillisecond Time Scale
97
-
tended conformation begins with a rapid collapse to amore
compact structure. During this time, varioustemporary hydrogen
bonds form, condensing the pep-tide and decreasing the costly loop
entropy that ham-pers the formation of the hydrophobic core.
Thesetemporary hydrogen bonds form and break; their pat-tern, which
varies from run to run, has no resemblanceto the nal
hydrogen-bonding pattern of the hairpin.Next, an interaction
between the hydrophobic coreresidues is established. This is
clearly the central
event in the folding process and most probably itsrate-limiting
step. Note that at this point the core isstill not fully formed:
the initial hydrophobic interac-tion most often involves just two
hydrophobic resi-dues on the opposite sides of the future hairpin.
Fullformation of the core typically appears simulta-neously with
the establishment of nal hydrogenbonds.
This pathway was also suggested by several othersimulation
methods. Pande and Rokhsar reported the
FIGURE 3 Quantitative validation of our method. We plot the
folding time distribution for a12-mer PPA helix calculated from the
ensemble dynamics (gray) and traditional molecular dynam-ics
(black) methods. We nd excellent agreement with both simulation and
with experiment (whichnds a characteristic time of10 ns). We
calculated the folding time as Mt, where t is the simulationtime of
the individual trajectory and M 20 processors were used. The
quantitative agreement hereshows that one can indeed achieve a
linear speed-up using 20 processors for the 12-mer PPA
foldingproblem. Inset: a folded PPA 12-mer.
Table I Summary of Predicted vs Experimentally Measured Folding
Timesa
Protein/MoleculePredicted time
(ns)Experimental time
(ns) Experiment reference
Polyphenylacetylene (PPA) 5.3b 10 14Fs peptide [AcA5(A3RA)3ANH2]
127c 160 60 10, 11C-terminal -hairpin of protein G
(AcGEWTYDDATKTFTVT-ENH2) 4700 1700 6000 15, 16Villin headpiece
20,000d 11,111 37, 38
a We see a very strong correlation between our prediction and
experiment. For a direct correlation, we nd R2 0.993, p value
0.008,and for a correlation of the log of these times, to match
Figure 8, we nd R2 0.993, p value 0.000026.
b PPA folds with nonexponential behavior.These numbers report
the fast time in a double exponential t.c If average the folding
times of the runs that folded, we get 82 60 ns. However, if we
include the data for the runs that did not fold,
we see that 5/7 folded in 160 ns; therefore using 5/7 1 exp(kt)
leads to a time of 1/k 127 ns.d This number represents an estimate
based on one folding event, and therefore has a large error and
thus is likely reliable solely as an order
of magnitude prediction.
98 Pande et al.
-
results of high temperature unfolding and refolding ofthe
-hairpin, in which a discrete unfolding pathwaywas recognized to
include a hydrophobically stabi-lized intermediate (H state) with
only the Val54side chain being released from the core and
littlehydrogen bonding occurring.19 Karplus and co-work-ers used
multicanonical Monte Carlo simulations tolook at folding of the
hairpin with similar results.33Garcia and Sanbonmatsu35 and Berne
and co-work-ers34 later veried the existence of this
intermediatethrough a temperature-exchange Monte Carlo/molec-ular
dynamics hybrid model of unfolding in which thethermodynamics of
the unfolding events are well de-scribed.35 They note that these
intermediates appearwith, on average, 2 fewer hydrogen bonds than
thefolded hairpin. This H intermediate was then ob-served in
mechanically driven unfolding simulationsperformed by Bryant et
al.,36 who describe it as in-cluding a nearly assembled core and
very little back-bone hydrogen bonding.
The picture of the folding process that emerges is,in essence, a
blend of the hydrogen-bond-centric andthe hydrophobic-core-centric
views of hairpin fold-ing: nonspecic hydrogen bonds are important
in theinitial stages of folding, but the key event that stabi-lizes
the U-shaped precursor of the hairpin and guidesthe downstream
folding process is the formation of ahydrophobic interaction
between core residues. Finalhydrogen bonds appear later, around the
same timethe full formation of the hydrophobic core occurs,
andthese continue to uctuate even after folding is com-plete.
Villin Headpiece
We have also simulated the folding of a thermostable,fast
folding,37 36-residue -helical subdomain (pdb-code 1VII) from the
villin headpiece38,6 (the C-termi-nal domain of the much larger
villin actin bindingprotein). Figure 5 details the nature of this
foldingtrajectory. We start from a completely elongatedstructure
and then see rapid relaxation into a random-walk unfolded state
(U). Next (Figure 5a), the C-terminal helix forms very quickly (at
the tenth gener-ation, G10 250 ns M t 250 10 generations 0.1
ns/generation; see Methods for details). Thistime is consistent
with helical folding times foundexperimentally39 and in
simulation.8 The protein thencollapses, driven by the attraction of
its hydrophobicgroups. While many residues have native-like
second-ary structure (see Figure 5b), there is a large degree
ofnon-native side-chain interaction, such as the contactof TRP24
and PHE36 with hydrophobic core resi-dues, although they are
solvent exposed in the native
structure. This intermediate collapsed thermodynamicstate (I)
consists of an ensemble of many confor-mations with partial native
secondary structure, butconfounded by a lack of native side-chain
packing.
The protein remains in this state for a very longtime (the
equivalent of 3.2 s) until a thermaluctuation occurs which breaks
key non-native inter-actions that were preventing the formation of
thehydrophobic core. Once these non-native contacts arebroken, the
protein rapidly folds to its native state(N). Looking at this
transition in more detail (Fig-ure 6), we see that in order to
break non-native con-tacts (such as, but not limited to, the
interaction be-tween PHE11 and PHE36 in G200), the protein
ex-pands, breaking many contacts (G210), and thencollapses into its
native fold (G225), as identied bya root mean square deviation
(RMSD) similar to thatfound by exploring the native state in our
unfoldingsimulation (i.e., 34 A ; see below). This event
occursafter the equivalent of 5.5 s, which is within the
timeestimated experimentally (on the order of 10 s).37Since PHE36
forms non-native (misfolded) contactsin this intermediate state (as
well as the intermediatefound in the Kollman simulation6), we
predict thatremoving this bulky hydrophobic side chain wouldlikely
increase the folding rate.
We have performed four other coupled simulationsthat have also
each reached the 5 s time scale (datanot shown). All of these
trajectories have reached I(RMSD between 5 and 7 A , radius of
gyration Rgbetween 7 and 10 A ), but none have reached the Nstate.
Statistically, this is not surprising and can beused to estimate
the folding rate (see Methods,above): if the mean folding time for
villin were 20 sand it follows exponential kinetics, then one
wouldexpect that 20% of runs would fold in 5 s, inagreement with
our results.
We have also used Folding@Home to study thenative state of
villin, i.e., by starting simulations fromthe NMR structure.38 One
use of such simulations isthe determination of the variability of
conformationsin the native state ensemble. Moreover, since
ourmethod allows us to simulate events that would occuron the
microsecond time scale, we should also be ableto simulate villin
unfolding under experimental con-ditions (e.g., 300 K). We see
(Figure 7) that confor-mations within the native state typically
have a 34 ARMSD from the NMR structure. Thus, we identifyour N
state with the native state of this protein sinceour folding
simulation reaches an ensemble of con-formations with4 A RMSD (our
conformation fromthe folding run, which was most similar to the
NMRstructure, had a 3.3 A RMSD). Moreover, our nativestate
simulation was run long enough to explore un-
Protein Folding Simulations on the Submillisecond Time Scale
99
-
folding to the intermediate state: the protein did notcompletely
unfold during the simulation time scale(1 s). However, a transition
to the partially foldedintermediate (I) was detected.
Finally, we compare our results to what one mightexpect from
protein folding theory. One of our pri-mary results is that folding
appears to proceed throughtransitions between free energy minima:
starting in an
unfolded state (U) to an intermediate (I) and then tothe native
state (N) (e.g., see reviews Refs. 13 and40, and references
therein). As previously dis-cussed,13 the collapse to the I state
appears to bedriven by hydrophobic interactions. However,
consid-ering that villin is one of the fastest folding
proteinscurrently known, it is interesting to consider thatmany of
these interactions were non-native.
100 Pande et al.
-
While there is little structure in U, the intermediateI is
collapsed, with some native tertiary structure,much non-native
structure, and little side-chain pack-ing. Thus, I is very much
like the molten globuleintermediates found in other proteins.41
Also, the in-termediate state found by Duan and Kollman6 ts ourI
state, since it is collapsed, partly native, but missingcertain
native contacts, and satises our I state de-nition in terms of the
RMSD and radius of gyration.We see a somewhat cooperative U 3 I
transition(e.g., reected in free energy barriers in Figure 7a)and a
very cooperative I3 N transition, as predictedpreviously.42,43
Indeed, the cooperativity of the I3 Ntransition appears to result
from side chain packing inour model, since many non-native contacts
must col-lectively break to allow the formation of native
side-chain packing. It seems highly unlikely that the
nativestructure could be reached so rapidly through piece-wise
movements without this collective event.
It is clear that any potential set employed to modelatomic
interactions will have its limitations. The rel-evant question to
ask is, How good do they need to beand what would result from
errors in these potentials?Since we do see folding to the native
state, it appearsthat the potentials we used were sufcient in this
case.However, we cannot rule out the possibility that in ourmodel,
the I state is comparably (or more) stable thanthe N state. This
could be the result of slight errors inthe potentials.44 Much like
adding denaturant in aphysical example, adding errors to a
potential reducesthe energetic favorability of the N state and
thusmakes the I state relatively more favorable due to itsentropic
advantage.44
DISCUSSION
Scalability of the Algorithm
As more computers become accessible to distrib-uted computing
methods, it is important to under-stand the limits of the
scalability of the method, i.e.,the limits to the number of
processors one can useto achieve a speed increase. While this
method canyield signicant speed improvements for simulatingcomplex
systems (and scalability considerably be-yond traditional parallel
MD), there are some im-portant limitations to its scalability we
must con-sider. For example, simulating a process where themean
time to fold tfold 100 ns using M 106processors will not
necessarily mean that one willachieve folding events using only
tfold/M 100 fstrajectories. The scalability will be inherently
lim-ited by the barrier crossing time tcross (i.e., the timespent
actually crossing the barrier, not including themuch longer time
spent waiting in the free energyminima, which dominates the folding
time tfold).Since the speed increase from our method is due tothe
elimination of the waiting time, we expect thatM tfold/tcross
additional processors will not giveany additional speed
increase.8,9 Thus, the boundsof scalability for this method are
also related to aninteresting physical question: How much time
isrequired to actually cross the free energy barrier?This time can
be quantied by using our method tolook for the limits of
scalability within our tech-nique.
For the proteins we have examined, it is likely thatthis time is
on the hundreds of picoseconds to nano-second time scale. It is
interesting to consider how
FIGURE 4 A detailed analysis of a folding trajectory of the
-hairpin from the C-terminalsegment of protein G. (a) Cartoon
representation of the folding trajectory; the backbone of
thepeptide is represented as a gray trace; the core hydrophobic
residues (Trp43, Tyr45, Phe52, Val54)are shown in dot
representation; (b) RMSD from the 1GB1 structure of the hairpin
(residues 4354),radius of gyration, and the number of
backbonebackbone hydrogen bonds; (c) distance betweenkey hydrogen
bonding partners (green: Trp43Val54; red: Tyr45Phe52), and the
minimumdistance between Trp43 and Phe52 (black). Note that the
minimum distance between Trp43 andPhe52 reaches its nal value
before the key hydrogen bonds are established; (d) solvation
energy(Esolvation), chargecharge energy (Echarge), and total
potential energy vs time. The initial hydro-phobic collapse of the
unfolded peptide correlates with a sharp decrease in Etotal, while
theattainment of the nal structure correlates with Etotal reaching
its nal value. A signicant deviationaround G160 of Echarge and
Esolvation from their nal value is correlated with the temporary
breakingof the key Tyr45Phe52 hydrogen bond; (e) a concise summary
of the key events along the foldingtrajectory (color code:
yellowhigh; violetlow). HB-ij denotes the distance between the
hydro-gen bonding partners i and j; min-kl denotes the minimum
distance between residues k and l. Notethat the establishment of
the hydrophobic Trp43Phe52 interaction is the earliest event of
signi-cance along the trajectory. Time is reported in the number of
generations: roughly, 1 generationcorresponds to 100 processors 0.1
ns/generation/processor 10 ns/generation.
Protein Folding Simulations on the Submillisecond Time Scale
101
-
this minimum time varies with the folding time. Sincethere need
not be any correlation between these times,it is possible that
slower folding proteins (e.g., thosewhich fold on the millisecond
and longer time scale)could be folded using our method with current
micro-processors by simply employing more of them. In-deed,
computational resources on the million-proces-
sor scale have been proposed, such as IBMs BlueGene, as have
other distributed computing projects.With such computational
resources, it is possible thatwe could push our simulations from
the hundreds ofmicrosecond timescale to fractions of a second,
al-lowing us to reach timescales relevant for slow fold-ers.
102 Pande et al.
-
Limitations of Our Methods to PredictFolding Rates and
Mechanisms
Below, we summarize the approximations involved inour rate
determination method above, and our justi-cations and reasoning of
these approximations. First,we assume that the barrier crossing
probability den-sity is exponential. This does not mean that the
totalkinetics is single exponential, but that the time tocross an
individual barrier is exponential. Second, weassume that
transitions are correctly identiedi.e.,that there are no false
positives or false negatives.This second assumption is of greater
concern. Whilewe cannot know for certain that we are
correctlyidentifying transitions, our ability to predict rates
sug-gests that incorrect transition prediction is not anissue.
(Perhaps this may be due to the fact that tran-sitions were rarely
detected in the larger proteinsstudied and were typically found at
the beginning ofthe folding trajectoriessee the section Is
TransitionDetection Really Necessary? below). However, wecan
mathematically address the consequences of in-
correct transition detection and nonexponential kinet-ics, as we
have done in a previous work.8 Finally, inour error analysis above,
we can calculate the statis-tical uncertainty of our rates. Even
with only tens ofsuccessful folding trajectories, the statistical
uncer-tainty is negligible.
Accuracy of Implicit Solvent Models forProtein Folding
For all of the folding simulations presented here, wehave used
the GB/SA method.27 While GB/SA makesconnections to physical
arguments about the nature ofinteractions via internal vs external
dielectrics, it is anempirical theory. Nevertheless, GB/SA performs
verywell at predicting the solvation free energy of
smallmolecules,27 and it is perhaps not surprising that itappears
to be sufciently accurate in the prediction ofthe folding rate of
small proteins and peptides. More-over, Caisch and co-workers have
also had success-ful results using even simpler implicit solvation
mod-
FIGURE 6 Examination of the I to N transition of the villin
headpiece in detail. We see that inorder to correctly fold, the
protein must rst unfold and open its conformation in order for it
to formthe missing native state interactions. Visualization is the
same as in Figure 2a. See text for moredetails. The nal state
agrees reasonably well with the average rened NMR conformation from
theProtein Data Base.38
FIGURE 5 Anatomy of a folding trajectory of the villin
headpiece. (a) Signicant representativeconformations along the
trajectory. The protein is visualized as a backbone trace with the
aromaticresidues (PHE7, PHE11, PHE18, TRP24, PHE36) space lled and
colored gray, red, cyan, yellow,and blue respectively. (b)
Secondary structure from DSSP51 (black helix, gray turn, white
nostructure). (c) Native contact density for each residue (blue
low, red middle, yellow high).(d) Radius of gyration and RMSD from
the native state (the native state is dened from the averageof a 10
ns traditional MD simulation at 300 K starting from the NMR
structure38) are plotted; weuse only -carbons in this calculation
and omit the rst and last 2 residues in the RMS calculation(as they
are unstructured in the rened NMR conformation). (e) Solvation free
energy (Fsolvation),charge/charge energy (Echarge), and total
internal free energy (Ftotal) vs time. While Ftotal
graduallydecreases over the whole simulation, we see that
Fsolvation has an initial decrease, but then graduallyincreases
over the simulation, whereas Echarge consistently decreases. In
fact, Echarge and Fsolvationare highly correlated (R2 0.92) during
this trajectory. (f) Fraction of all and native contacts vstime. In
all frames, time is on the horizontal axis. It is most natural to
report time in terms of 100ps generations (see Figure 1); roughly,
one can approximate time as8 250 processors
0.1ns/generation/processor 25 ns/generation. We label conformations
by their generation (e.g.,G225 in the upper right).
Protein Folding Simulations on the Submillisecond Time Scale
103
-
els (distance-dependent dielectric with a surface areaterm
following Still et al.27) in folding simula-tions.31,45 However, it
is unclear whether similar ac-curacy (in either rate prediction or
even in reachingthe native state) would be achieved using
implicitsolvation models for folding simulations of
largerproteins.
Second, we stress that when employing any im-plicit solvent
model for the faithful reproduction ofkinetics, one must take into
account the viscosity ofthe solvent (in addition to the dielectric
and hydro-phobic aspects). We have done so using Allens sto-chastic
integrator, as implemented in Tinker.25 Thisscheme is an extension
to Langevin dynamics, andincludes both viscous drag and random
forces in theforce equation, to match the viscosity of the
solventand the random thermal uctuations that the solventwould
apply to the solute. However, unlike pure Lan-gevin dynamics, this
method scales the drag and therandom force by the solvent-exposed
area in order toonly apply these solvation effects to atoms that
areactually solvent exposed. Often, implicit solvation isrun
without any viscosity model (or viscosity consid-erably lower than
water,45 i.e., the viscosity parameter 90/ps), which leads to
differences in samplingand cannot lead to accurate rate
predictions. This is, in
some cases, considered to be an advantage of implicitsolvation:
one would expect that the speed of dynam-ics is inversely
proportional to viscosity. However,since the magnitude of the
random forces is alsoproportional to the viscosity, decreasing the
viscositydiminishes the strength of these random forces.
Coun-terintuitively, this may actually decrease the samplingas it
is these very random forces that enable thesystem to cross free
energy barriers.18 Since our goalis the faithful reproduction of
folding kinetics, wehave chosen a viscosity damping parameter in
order tomatch that of water.46
Third, it is interesting to consider the possibledifferences
between implicit and explicit solvationmodels. While implicit
solvation models can capturemany important properties of the
solvent, such as thedielectric effect, hydrophobicity, and
viscosity, thereare effects that are missing. In particular, any
physicaleffect that arises from the discrete nature of
watermolecules, such as proteins hydrogen bonding to wa-ter,
solvent-separated minima, or the drying effect,will be lost. Again,
it is important to keep in mind thatall models are approximations,
and the relevant ques-tion is not whether a model is correct (since
allmodels are incorrect at some level), but whether agiven model is
correct enough to capture the rele-
FIGURE 7 Rough characterization of the underlying free energy
landscape for the villin head-piece. We plot the log of the
probability of nding conformations with a given Rg and RMSD in
(a)folding and (b) unfolding simulations. We nd three distinct
probability maxima (which correspondto free energy minima): an
unfolded state, molten-globule-like intermediate, and the native
state.This landscape generated from kinetic data qualitatively
agrees with previous, more extensivethermodynamic
calculations.2
104 Pande et al.
-
vant physics to faithfully reproduce and predict thephysical
effect of interest. For the small proteins wehave examined, it
appears that the model we haveemployed is indeed correct enough for
predictingrates (see Figure 8 and Table I). This implies thateither
the discrete nature of water is not relevant forfolding, that
folding rates are fairly robust to suchinaccuracies of the model,
or that there is a convenientcancellation of errors. In order to
discern betweenthese two possibilities, one must resimulate these
pro-teins with explicit solvation models and compare therate and
mechanistic predictions; if these predictionsagree, then perhaps
the potential gain in accuracy ofexplicit solvent models would
indeed not be relevantfor folding kinetics. Furthermore, we stress
that ex-plicit solvent models make approximations as well,47and
there is no reason why an arbitrary explicit sol-vation model would
necessarily be better than a well-designed implicit model.
Finally, it is important to consider that the questionof the
validity of implicit solvation models goes be-yond a simple debate
of the validity of particularcomputational methodology, but also
impacts the wayin which one thinks of protein structure in general.
Ifexplicit solvation were critical to protein folding, thenit is
likely that one should not think of protein struc-
tures without the requisite cloud of water molecules itinteracts
with, as it is the very discrete and potentiallystructural aspects
of the water that play a large role infolding. However, if implicit
solvent models are suf-ciently accurate, this suggests that a
structural pic-ture of a protein alone (implicitly considering
theeffects of water, such as hydrophobicity, etc., but notwith a
discrete, structural form in mind) is indeedsufcient.
Alternative Methods to SimulateDynamical Events on Long Time
ScalesUsing Low Viscosity Simulation
Water is a relatively viscous solvent. Indeed, in quan-titative
terms, the damping force of water is on theorder of 100/ps. It is
intriguing to considerwhether one can simulate the effective result
of longtime-scale events by simulating the effect of muchlower
viscosity solvents, say 1/ps, while keepingall of the other
properties of a water-implicit solvationmodel unchanged. This is
appealing since this is triv-ial to perform with implicit solvation
models and thisability to explicitly set the viscosity of the
solvent inthe model may represent one of the great strengths
ofusing implicit solvent models.
FIGURE 8 Comparison of theoretical rate predictions from
@Folding@Home and the accordingexperimental folding rate
determinations. We compare the folding rates for the proteins
andpolymers described in this review: PPA, polyalanine-based
helices, the C-terminal -hairpin fromprotein G, and the villin
headpiece. If our folding rate prediction were perfect, all points
would lieon the diagonal line. The agreement strongly suggests that
our method can accurately predict theabsolute folding rate for
small proteins, peptides, and foldamers.
Protein Folding Simulations on the Submillisecond Time Scale
105
-
If one were to decrease the viscosity by 100 times,could one
simulate 10 ns and expect to get 1000 ns 1 s of sampling? This
question has been ad-dressed in many models and systems. For
example,Klimov and Thirumalai48 have shown that (for
acoarse-grained protein model) the rate of folding in-creases with
decreasing viscosity to a point ( 1/ps)at which the rate decreases
with decreasing viscosity.This nonmonotonic dependence can be
understood interms of the dual role of the solvent viscosity:
viscos-ity retards motion through the solvent, but also createsthe
random forces that are needed to drive the systemover energy and
free energy barriers. Thus, the rateshould be optimal at
intermediate viscosity. If thispeak in the rate vs viscosity curve
does peak at 1/ps, then one should expect an increase in sam-pling
at viscosities in between 1/ps and 100/ps, andfor thermodynamic
properties, this increased sam-pling should be benecial. However,
it is still unclearwhether kinetic properties would be unchanged
bysignicant changes in the viscosity.
Alternative Methods to SimulateDynamical Events on Long Time
ScalesUsing Large-Scale DistributedComputing
In this review, we have discussed protein foldingsimulations
using ensemble dynamics, our parallelreplica-like method intended
to handle free energybarrier crossing problems. The greatest
weakness ofthis method rests in the need for transitions and for
theaccurate identication of these transitions. Our sug-gested means
to identify transitions, looking for en-ergy variance spikes during
dynamics, has the benetof being a purely thermodynamic method and
thusdoes not use any information of the protein nativestate or any
folding-related hypothesized reaction co-ordinates. However, if
transitions are incorrectly iden-tied, the validity of the
resulting data is put underquestion. Considering that great
computational re-sources are needed to generate the folding
simulationspresented here, this limitation could be very expen-sive
computationallythe failure to accurately iden-tify transitions may
mean that the resulting data set isinvalid.
It is interesting to consider a simpler method,which does not
have the liabilities described above.Namely, instead of loosely
coupling simulations (i.e.,restarting simulations after transitions
have been de-tected), one could simply run a large number M
ofcompletely independent simulations. For single expo-nential
kinetics, we would still gain an M times speed-up (as described
above). However, even if the reaction
under study did not have single exponential kinetics,independent
trajectories might still have value. In-deed, in a sense, a set of
thousands of simulationseach on the tens of nanosecond time scale
is a data setthat stands on its own. For example, one could
inter-pret the results for single-exponential kinetics, byexamining
the fraction f(t) that fold in time t andtting a rate with the
slope. However, one would notbe limited to this exponential
kinetics analysis, andthis data could be reanalyzed a postiori to
test newhypotheses or kinetic models. Considering the
greatcomputational cost of producing these data sets, thismore pure
method for simulating kinetics has agreat appeal. Indeed, we have
reexamined the foldingof villin with uncoupled trajectories49 in
this manner(B. Zagrovic, et al. J Mol Biol, 2002, in press) and
itwill be interesting to determine how the uncoupledsimulations
differ (e.g., in rates and mechanism) fromthose presented here.
Is Transition Detection ReallyNecessary?
The discussion above regarding the possibility of us-ing
independent trajectories and still gaining a speedincrease linear
with the number of processors raisesthe question, Must one bother
with transition detec-tion as used here? Another way to examine
this ques-tion is to ask with what frequency were
transitionsdetected in the examples described here. For the
helixfolding simulations, 2 or 3 transitions were detectedbefore
the simulation reached the folded state. Therst transition
accompanied the rst formation of he-lical structure and the other
transitions occurred dur-ing propagation. For the larger molecules,
transitionswere even more infrequent. For example, the -hair-pin
and villin simulations typically had a single tran-sition that
occurred early in the folding process, ac-companying the collapse
of the protein chain.
Thus, we nd that for the larger molecules, transi-tion detection
was likely not necessary, since thetransitions occurred earlier and
thus the simulationswere essentially running independently (as
suggestedin the subsection above). We suggest that the transi-tions
were not needed since these larger moleculesfold with single
exponential kinetics, and thus have asingle rate-limiting step. The
helices are potentiallydifferent: the rates of nucleation and
propagation ofhelices in our model are not highly separated (e.g.,
seethe trajectories in Figure 2) and thus transition detec-tion may
be needed in the helix case, but not for the-hairpin or villin
molecules.
106 Pande et al.
-
CONCLUSIONS
Comparison to Experiment
With a wide range of molecules under study, from
thenonbiological PPA helices to the 36-residue villinheadpiece, we
have simulated a set of molecules witha range of folding times
spanning over four orders ofmagnitude, from nanoseconds to tens of
microsec-onds. Since the primary means of comparison to ex-periment
is the comparison of rates determined bysimulation and experiment,
we concentrate on ourprediction of rates. Figure 8 shows a striking
agree-ment between predicted and experimental rates (seeTable I for
details). Of course, with just four mole-cules simulated, it is
unclear whether this agreementis simply fortuitous. In order to
more fully addressthis question, we plan to simulate the folding
kineticsof additional molecules, including larger and moreslowly
folding proteins. Indeed, more recent work ona small -fold (C. Snow
et al., Nature, 2002, inpress) and villin (B. Zagrovic et al., J
Mol Biol, 2002,in press) also result in strong agreement with
experi-mental rates.
With this quantitative agreement with experiment,it is also
interesting to ask how do our results reectupon the quality of
modern force elds? On the sur-face, one might conclude that our
agreement withexperiment is evidence that force elds are
suf-ciently accurate. We stress that the only question thatcan
truly be addressed by our work is whether forceelds are sufciently
accurate to reproduce experi-mental rates and structures. Ignoring
for the momentthe possibility that the agreement may be
fortuitous,the agreement between our simulations and experi-ments
suggest that force elds are sufciently accu-rate to predict the
folding rates of small proteins.Indeed, this accuracy can be
quantied in terms of thestrong correlation (R2 0.996) and low p
value(0.000026) of the logarithms of the predicted to ex-perimental
rates. However, this statement should def-initely not be
overgeneralizedit is unclear whetherthe analogous rate prediction
for large protein foldingwould be similarly accurate or whether
these resultsare fortuitous (such that the simulation of
additionalproteins would weaken the correlation). We are cur-rently
addressing this question by examining the fold-ing of different and
larger proteins.
What Have We Learned About theProtein Folding Mechanism?
The question of how proteins fold has been askedfor decades, and
remains a difcult problem due to the
complexities and difculties of computational andexperimental
methods. However, the methods pre-sented here have allowed us to
understand, for the rsttime, the folding mechanism for some small
fast fold-ing proteins, in atomistic detail with
experimentallyvalidated rates (Figure 8 and Table I). We have
beenable to discern the mechanism of a few particularproteins, but
it is unclear whether we can expect theseto generalize to larger
and more complex proteins. Anunderstanding of the mechanism of
larger proteinswill likely require further direct simulation.
However,considering the diversity of mechanistic results foundeven
in these small proteins, it seems reasonable toconsider that there
may not be a single, universalfolding mechanism. Indeed, evolution
may be mech-anistically agnostic and may have selected proteinsfor
function, without concern for folding mechanism.This could lead to
a variety of protein folding mech-anisms (even for sequences which
fold to the samestructure), and thus there may not be a single
answerto the question of how proteins fold.
Future Perspectives
The ensemble dynamics technique coupled with dis-tributed
computing has allowed us to break funda-mental computational
barriers in the dynamics ofcomplex systems, such as protein and
polymer fold-ing. However, one need not build a distributed
com-puting infrastructure to gain the benets of our meth-ods.
Indeed, with a cluster accessible to almost anygroup (e.g., a
hundred PCs), one can simulate 100 nsin a day (assuming 1
ns/processor). This is a signi-cant advance over state of the art
of traditional parallelmolecular dynamics.6 Of course, the
combination ofour method on top of traditional parallel MD
(i.e.,using traditional MD to speed up individual simula-tions to
the maximum scalability of parallel MD andthen using our method to
statistically sample runs)may lead to the greatest advance,
especially on mas-sively parallel architectures with millions of
proces-sors, such as IBMs proposed million processor BlueGene
supercomputer.
Moreover, this technique should have broad appli-cability to any
dynamical system that progresses bycrossing free energy barriers,
especially in the mostintractable problems with high free energy
barriers. Itcould also serve to augment existing
computationalmethods, such as path sampling20 (which
requiressimulating a fast trajectory over the relevant freeenergy
barriers) or the determination of transitionstates using pfold
analysis50 (which is currently hin-dered by simulations dwelling in
transiently stableintermediate states).
Protein Folding Simulations on the Submillisecond Time Scale
107
-
Finally, with the ability to reach time scales forprotein
folding in all-atom simulations (i.e., hundredsof microseconds), it
is natural to ask whether thepotential sets are adequate for
folding. Indeed, due tothe great number of calculations involved,
distributedcomputing networks will most likely play an impor-tant
part in providing sufcient computational powerto extensively test
and validate new potential sets.Considering the omnipresent role of
force elds instructural biology, ranging from simulations, to
infor-matics, to x-ray and NMR renement, the ability
toquantitatively test force elds will likely play a criti-cally
important role in structural biology and virtuallyall related
elds.
It is our honor to have this work be part of the memorialissue
for Peter Kollman. Peter was a great inspiration as ascientist and
a senior colleaguegenerous with time andwith praise and with
numerous useful and encouragingsuggestions. Indeed, in many ways
our work on foldingdynamics was inspired by his work with Yong Duan
on thevillin headpiece, as it opened our eyes to see just how
closethe eld was to simulating time scales relevant for foldingand
thus to nally directly simulate protein folding.
We would also like to thank Kevin Plaxco for a criticalreading
of the manuscript, Robert Baldwin and Susan Mar-qusee for their
comments about -helices, Martin Gruebeleand Jeff Moore for their
comments about PPA foldingkinetics, Dan Raleigh and collaborators
for their unpub-lished results on the experimental folding time for
the36-residue villin fragment, Jay Ponder for allowing our useof
the Tinker MD code in the Folding@Home client, andBill Swope and
Jed Pitera for their advice with modifyingand extending Tinker. We
would also like to thank ScottGrifn and the other members of the
Intel distributed com-puting team for their help with the
Folding@Home infra-structure and support of our work.
Much of this method was developed, tested,
originallyimplemented, and run on the T3E at the National
EnergyResearch Scientic Computing (NERSC) center at Law-rence
Berkeley National Labs. We also thank the AdvancedBiomedical
Computing Center, National Cancer Institute(NCI), Frederick,
Maryland, for its assistance and use of itsSP3 and SV1. The helix
and PPA production calculationswe performed at NERSC and NCI. The
other productioncalculations (beta hairpin and villin) were run
onFolding@home. We would especially like to thank the tensof
thousands of Folding@Home contributors, withoutwhom this work would
not be possible (a complete list ofcontributors can be found at
http://Folding.Stanford.edu).
JC and EJS acknowledge support in the form of Com-putational
Science Graduate Fellowship (DOE). SL is aJames Clark fellow and
acknowledges the support of aStanford Graduate Fellowship. MS
acknowledges the sup-port of a Fannie and John Hertz fellowship and
a StanfordGraduate Fellowship. YMR acknowledges the support of
aStanford Graduate Fellowship. CS and BZ each acknowl-
edge support from a HHMI predoctoral fellowship. Thiswork was
supported by grants from the ACS PRF (36028-AC4), NSF MRSEC CPIMA
(DMR-9808677), NIH BISTI(IP20 GM64782-01), ARO (41778-LS-RIP), and
StanfordUniversity (Internet 2), as well as by gifts from the Intel
andGoogle corporations.
REFERENCES
1. Dill, K. A.; Chan, H. S. Nat Struct Biol 1997, 4, 1019.2.
Brooks, C. L.; Gruebele, M.; Onuchic, J. N.; Wolynes,
P. G. Proc Natl Acad Sci USA 1998, 95, 1103711038.3. Dobson, C.
M.; Sali, A.; Karplus, M. Angew Chem Int
Edit Engl 1998, 37, 868893.4. Prusiner, S. Proc Natl Acad Sci
USA 1998, 95, 13363
13383.5. Nelson, J. C.; Saven, J. G.; Moore, J. S.; Wolynes, P.
G.
Science 1997, 277, 17931796.6. Duan, Y.; Kollman, P. A. Science
1998, 282, 740744.7. Shirts, M. S.; Pande, V. S. Science 2000, 290,
1903
1904.8. Shirts, M. S.; Pande, V. S. Phys Rev Lett 2000.9. Voter,
A. F. Phys Rev B 1998, 57, 1398513988.
10. Williams, S.; et al. Biochemistry 1996, 35, 691697.11.
Thompson, P.; Eaton, W.; Hofrichter, J. Biochemistry
1997, 36, 92009210.12. Nelson, J. C.; Saven, J. G.; Moore, J.
S.; Wolynes, P. G.
Science 1997, 277, 17931796.13. Elmer, S.; Pande, V. S.; J Phys
Chem B 2001, 105,
482485.14. Yang, W. Y.; Prince, R. B.; Sabelko, J.; Moore, J.
S.;
Gruebele, M. J Am Chem Soc 2000, 122, 32483249(2000).
15. Blanco, F. J.; Serrano, L. Eur J Biochem 1995,
230,634649.
16. Munoz, V.; Thompson, P. A.; Hofrichter, J.; Eaton,W. A.
Nature 1997, 390, 196199.
17. Bryngelson, J. D.; Onuchic, J. N.; Socci, N. D.;Wolynes, P.
G. Proteins Struct Funct Genet 1995, 21,167195.
18. Chandler, D. J Chem Phys 1978, 68, 29592970.19. Pande, V.
S.; Rokhsar, D. S. Proc Natl Acad Sci 1999,
96, 90629067.20. Dellago, C.; Bolhuis, P. G.; Csajka, F. S.;
Chandler, D.
J Chem Phys 1998, 108, 19641977.21. Doniach, S.; Eastman, P. A.
Curr Opin Struct Biol
1999, 9, 157163.22. Elber, R. Curr Opin Struct Biol 1996, 6,
232235.23. Pande, V. S.; Rokhsar, D. S. Proc Natl Acad Sci
1999,
96, 12731278.24. Pande, V. S.; Grosberg, A. Y.; Tanaka, T.;
Rokhsar,
D. S. Curr Opin Struct Biol 1998, 8, 6879.25. Pappu, R. V.;
Hart, R. K.; Ponder, J. W. J Phys Chem
B 1998, 102, 97259742.26. Jorgensen, W. L.; Tirado-Rives, J. J
Am Chem Soc
1988, 110, 16661671.
108 Pande et al.
-
27. Qiu, D.; Shenkin, P. S.; Hollinger, F. P.; Still, W. C.
JPhys Chem A 1997, 101, 30053014.
28. Andersen, H. C. J Comput Phys 1983, 52, 2434.29. Pande, V.
S.; Rokhsar, D. S. Proc Natl Acad Sci 1999,
96, 12731278.30. Young, W. S.; Brooks, C. J Mol Biol 1996, 96,
560
572.31. Ferrara, P.; Apostolakis, J.; Caisch, A. J Phys Chem
B
2000, 104, 50005010.32. Zagrovic, B.; Sorin, E. J.; Pande, V. J
Mol Biol 2001,
313, 151169.33. Dinner, A. R.; Lazaridis, T.; Karplus, M. Proc
Natl
Acad Sci USA 1999, 96, 90689073.34. Zhou, R.; Berne, B.;
Germain, R. Proc Natl Acad Sci
USA 2001, 98, 1493114936.35. Garcia, A. E.; Sanbonmatsu, K. Y.
Proteins 2001, 42,
345354.36. Bryant, Z.; Pande, V. S.; Rokhsar, D. S. Biophys
J
2000, 78, 584589.37. Raleigh, D.; et al. Private
communication.38. McKnight, C. J.; Matsudaira, P. T.; Kim, P. S.
Nat
Struct Biol 1997, 4, 180184.
39. Thompson, P. A.; Eaton, W. A.; Hofrichter, J. Bio-chemistry
1997, 36, 92009210.
40. Pande, V. S.; Grosberg, A. Y.; Tanaka, T.; Rokhsar,D. S.
Curr Opin Struct Biol 1998, 8, 6879.
41. Ptitsyn, O. B. Adv Protein Chem 1995, 47, 83229.42.
Finkelstein, A. V.; Shakhnovich, E. I. Biopolymers
1989, 28, 16671680.43. Pande, V. S.; Rokhsar, D. S. Proc Natl
Acad Sci USA
1998, 95, 14901494.44. Pande, V. S.; Grosberg, A. Y.; Tanaka, T.
J Chem Phys
1995, 103, 94829491.45. Ferrara, P.; Caisch, A. Proc Natl Acad
Sci 2000, 97,
1078010785.46. Cramer, C. J.; Truhlar, D. G. Chem Rev 1999,
99,
21612200.47. Jorgensen, W. L. J Am Chem Soc 1981, 103, 335.48.
Kilmov, D.; Thirumalai, D. Phys Rev Lett 1997, 79,
317320.49. Zagrovic, B.; Pande, V. S. 2002, in preparation50.
Du, R.; Pande, V. S.; Grosberg, A. Y.; Tanaka, T.;
Shakhnovich, E. I. J Chem Phys 1998, 108, 334350.51. Kabsch, W.;
Sander, C. Biopolymers 1983, 22, 25772637.
Protein Folding Simulations on the Submillisecond Time Scale
109