-
Skjærven et al. BMC Bioinformatics 2014,
15:399http://www.biomedcentral.com/1471-2105/15/399
SOFTWARE Open Access
Integrating protein structural dynamics andevolutionary analysis
with Bio3DLars Skjærven1,2*, Xin-Qiu Yao3, Guido Scarabelli3 and
Barry J Grant3*
Abstract
Background: Popular bioinformatics approaches for studying
protein functional dynamics include comparisons ofcrystallographic
structures, molecular dynamics simulations and normal mode
analysis. However, determining howobserved displacements and
predicted motions from these traditionally separate analyses relate
to each other, aswell as to the evolution of sequence, structure
and function within large protein families, remains a
considerablechallenge. This is in part due to the general lack of
tools that integrate information of molecular structure,
dynamicsand evolution.
Results: Here, we describe the integration of new methodologies
for evolutionary sequence, structure andsimulation analysis into
the Bio3D package. This major update includes unique
high-throughput normal modeanalysis for examining and contrasting
the dynamics of related proteins with non-identical sequences
andstructures, as well as new methods for quantifying dynamical
couplings and their residue-wise dissection fromcorrelation network
analysis. These new methodologies are integrated with major
biomolecular databases as wellas established methods for
evolutionary sequence and comparative structural analysis. New
functionality fordirectly comparing results derived from normal
modes, molecular dynamics and principal component analysisof
heterogeneous experimental structure distributions is also
included. We demonstrate these integrated capabilitieswith example
applications to dihydrofolate reductase and heterotrimeric
G-protein families along with a discussionof the mechanistic
insight provided in each case.
Conclusions: The integration of structural dynamics and
evolutionary analysis in Bio3D enables researchers to gobeyond a
prediction of single protein dynamics to investigate dynamical
features across large protein families. TheBio3D package is
distributed with full source code and extensive documentation as a
platform independent Rpackage under a GPL2 license from
http://thegrantlab.org/bio3d/.
Keywords: Protein structure, Protein dynamics, Allostery, Normal
mode analysis, Molecular dynamics, Principalcomponent analysis,
Evolution
BackgroundThe internal motions and intrinsic dynamics of
proteinshave increasingly been recognized as essential forprotein
function and activity [1,2]. Notable examplesinclude the dynamic
rearrangements that facilitate manyenzyme turnover events [3]; the
force producing structuralchanges of motor proteins [4]; and the
conformationaland allosteric mechanisms that modulate protein
associationsin many signal transduction cascades [5,6]. Dissecting
these
* Correspondence: [email protected];
[email protected] of Biomedicine, University of Bergen,
Bergen, Norway3Department of Computational Medicine and
Bioinformatics, University ofMichigan, Ann Arbor, Michigan, USAFull
list of author information is available at the end of the
article
© 2014 Skjærven et al.; licensee BioMed CentrCommons Attribution
License (http://creativecreproduction in any medium, provided the
orDedication waiver (http://creativecommons.orunless otherwise
stated.
functional motions typically relies on the accumulation
andcomparison of multiple high-resolution structures for a
givenprotein. The rapidly increasing availability of such datais
precipitating the need for new approaches that integrateknowledge
of molecular structure, dynamics and evolutionin functional
analysis. In addition to multiple structurecomparisons,
computational methods including moleculardynamics (MD) and normal
mode analysis (NMA) haveemerged as popular approaches for
characterizing proteindynamics and flexibility [7-9]. However, the
generallack of tools that integrate these traditionally
separateanalyses with methods for sequence and structuralanalysis
represents a practical bottleneck for the systematic
al. This is an Open Access article distributed under the terms
of the Creativeommons.org/licenses/by/4.0), which permits
unrestricted use, distribution, andiginal work is properly
credited. The Creative Commons Public
Domaing/publicdomain/zero/1.0/) applies to the data made available
in this article,
http://thegrantlab.org/bio3d/mailto:[email protected]:[email protected]://creativecommons.org/licenses/by/4.0http://creativecommons.org/publicdomain/zero/1.0/
-
Skjærven et al. BMC Bioinformatics 2014, 15:399 Page 2 of
11http://www.biomedcentral.com/1471-2105/15/399
study of the evolution of functional motions in large
proteinfamilies and superfamilies.Current software solutions lack
much of the flexibility
needed for comparative studies of large heterogeneousstructural
datasets. For example, popular web servers forNMA typically operate
on single structures and do notpermit high-throughput calculations
[10-12]. Softwarelibraries such as the Molecular Modeling
ToolKit(MMTK) [13] and the packages ProDy [14] andMAVEN [15]
provide more advanced calculation optionsbut generally lack direct
functionality for the quantitativecomparison of dynamic features of
non-identical structuresand sequences. These limitations complicate
the assessmentof functional motions in an evolutionary context.
TheBio3D package [16] now provides these essential compo-nents thus
greatly facilitating the study of evolutionarily re-lated ensembles
and their functional dynamics. Here,using selected case studies, we
demonstrate the inte-gration of versatile new ensemble NMA
approaches andcorrelation network analysis facilities with enhanced
inter-active tools for extracting mechanistic information
frommolecular sequences, crystallographic structural ensemblesand
MD trajectories. This major update to the Bio3Dpackage includes
extensive functionality to analyzeand visualize protein dynamics
from both experimentand simulation, together with tools for
systematicretrieval and analysis of publicly available sequence
andstructural data.
Package overview and architectureBio3D version 2.0 now provides
extensive functionalityfor high-throughput NMA of an ensemble of
proteinstructures facilitating the study of evolutionary
andcomparative protein dynamics across protein families.The NMA
module couples to major protein structureand sequence databases
(PDB, PFAM, UniProt andNR) and associated search tools (including
BLAST[17] and HMMER [18]). This enables the automatedidentification
and analysis of related protein struc-tures. Efficient elastic
network model (ENM) NMA isimplemented with multicore functionality
to enablerapid calculation of modes even for large
structuralensembles. Results of the ensemble NMA algorithminclude
aligned eigenvectors and mode fluctuationsfor the different
structures in the ensemble. These canreadily be analyzed and
compared with a variety ofimplemented methodologies. This
facilitates the predictionand identification of distinct patterns
of flexibility amongprotein families or between different
conformationalstates of the same protein. The user can perform
ensembleNMA by providing a set of either PDB structures orRCSB PDB
codes. Alternatively a single protein sequenceor structure can be
used to search the PDB for similarstructures to analyze.
A typical user workflow for the comparison of cross-species
protein flexibility is depicted in Figure 1. In thisexample, we
begin by fetching the protein sequenceof a PDB structure with the
get.seq() function. Thissequence is then used in a BLAST or HMMER
searchof the full PDB database to identify related
proteinstructures (functions blast() or hmmer()).
Identifiedstructures can then optionally be downloaded (withthe
function get.pdb()) and aligned using the functionpdbaln(). The
output will be a multiple sequencealignment together with aligned
coordinate data andassociated attributes. Ensemble NMA on all
alignedstructures can then be carried out with function nma().The
function provides an “eNMA” object containingaligned eigenvectors,
mode fluctuations, and all pair-wiseroot mean squared inner product
(RMSIP) values. Theseresults are formatted to facilitate direct
comparison of theflexibility patterns between protein structures,
as well asclustering based on the pair-wise modes similarity.
Alsoshown in Figure 1 is the typical application of
principalcomponent analysis (PCA) on the same
experimentalstructures using the function pca(). This provides
principalcomponents of the same dimensions as the normal
modesfacilitating direct comparison of mode fluctuations,
oralternatively mode vectors using functions such as rmsip()and
overlap(). Indeed extensive new functions for theanalysis of normal
modes and principal components arenow provided. These include
cross-correlation, fluctuations,overlap, vector field, dynamic
sub-domain clustering,correlation network analysis and movie
generation alongwith integrated functions for plotting and
visualization.Extensive multicore support is also included for a
numberof commonly used functions. This enables a
significantspeed-up for time-consuming tasks, such as ensembleNMA
for large protein families, modes comparison,domain assignment,
correlation analysis for multiplestructures, and analysis for
long-timescale MD simulations.Comprehensive tutorials integrating
NMA with PCA,simulation data from MD, and additional sequence
andstructure analysis methods, including correlation
networkanalysis, are available in Additional files 1, 2, 3 and
4.
ImplementationElastic network modelsA unique collection of
multiple ENM force fields isnow provided within Bio3D. These
include the popularanisotropic network model (ANM) [19], the
associatedparameter-free ANM [20], and a more sophisticatedC-alpha
force field derived from fitting to the Amber94 all-atom potential
[21]. Also included is the REACH force fieldemploying force
constants derived from MD simulations[22], and a recent
parameterization providing sequence-specific force constants
obtained from an ensemble of1500 NMR structures [23]. A convenient
interface for
-
Figure 1 Example workflow for ensemble NMA and PCA. In this
example the user starts with a single protein identifier, performs
a BLASTsearch to identify related structures, fetches and aligns
the identified structures, performs PCA and calculates the normal
modes for each structureto obtain aligned normal mode vectors.
Result interpretation and comparison of mode subsets is made
available through various methods forsimilarity assessment.
Skjærven et al. BMC Bioinformatics 2014, 15:399 Page 3 of
11http://www.biomedcentral.com/1471-2105/15/399
the application of user-defined force fields is also
providedenabling customized normal mode calculations, perturb-ation
analysis, and other more advanced options as detailedonline and in
Additional file 1.All implemented ENMs considered here employ a
harmonic potential, where the potential energy betweenresidues i
and j is given by:
Uij rð Þ ¼ k r0ij��� ���� � rij�� ��− r0ij��� ���� �2
where r is the current protein conformation, r0 representsthe
equilibrium conformation, and ‖rij‖ the distancebetween residues i
and j [24,25]. By default, the Bio3Dpackage employs the C-alpha
force field [21] derived fromfitting to the Amber94 all-atom
potential with pair forceconstants given by
k rð Þ ¼ 8:6⋅102⋅r−2:39⋅103; for r < 4:0Å
128⋅104⋅r−6; for r≥4:0Å
�
with units of k(r) given in kJ mol− 1 Å− 2. The selection
ofdifferent force fields is described in detail both onlineand in
Additional file 1.
Ensemble NMAIntegrated multiple sequence and structural
alignmentmethods are utilized to facilitate the analysis of
structuresof unequal composition and length. From these
alignments,equivalent atom positions across structure ensemblesare
identified and normal mode vectors determinedby calculating the
effective force-constant Hessianmatrix Κ̂ as
Κ̂ ¼ ΚΑΑ−ΚΑQK −1QQΚQΑ
where KAA represents the sub-matrix of K correspondingto the
aligned C-alpha atoms, KQQ for the gapped regions,and KAQ and KQA
are the sub-matrices relating thealigned and gapped sites [21,26].
The normal modes
-
Skjærven et al. BMC Bioinformatics 2014, 15:399 Page 4 of
11http://www.biomedcentral.com/1471-2105/15/399
of the individual structure in the ensemble can thenbe obtained
by solving the eigenvalue problem
VT K̂V ¼ λ
where V is the matrix of eigenvectors and λ the
associatedeigenvalues.
Ensemble PCAPrincipal component analysis can be performed on
anystructure dataset of equal or unequal sequence compositionand
length to capture and characterize inter-conformerrelationships.
The application of PCA to both distributionsof experimental
structures and MD trajectories, along withits ability to provide
considerable insight into the nature ofconformational differences
in a range of protein familieshas been previously discussed
[27-30]. Briefly, PCA is basedon the diagonalization of the
covariance matrix, C, withelements Cij calculated from the aligned
and superimposedCartesian coordinates, r, of equivalent Cα
atoms:
Cij ¼ ri− rih ið Þ⋅ rj− rj� �� � �
where i and j enumerate all 3N Cartesian coordinates(N is the
number of atoms), and 〈r〉 denotes the ensembleaverage. Projection
of the distribution onto the subspacedefined by the PCs with the
largest eigenvalues provides alow-dimensional representation of the
structures facilitatinginter-conformer analysis.
Similarity measuresMultiple similarity measures have been
implemented toprovide an enhanced framework for the assessment
andcomparison of ensemble NMA and PCA. These mea-sures also
facilitate clustering of proteins based on theirpredicted modes of
motion:Root mean square inner product (RMSIP) measures
the cumulative overlap between all pairs of the l
largesteigenvectors [31], and is defined as:
RMSIP ¼ 1l
Xli¼1
Xlj¼1
vAi ⋅vBj
� �2 !12
where vAi and vBj represent the ith and jth eigenvectors
obtained from protein A and B, respectively. l is thenumber of
modes to consider which is commonlychosen to be 10. The RMSIP
measure varies between 0(orthogonal) and 1 (identical
directionality).Covariance overlap provides a measure of the
corres-
pondence between the eigenvectors (vi) similar to the
RMSIPmeasure, but also includes weighing by their associated
eigenvalues (λi) [32]. It ranges from 0 (orthogonal) to
1(identical), and is defined as:
CO ¼ 1−Xl
i¼1 λAi þ λBi
� −2Xl
i¼1Xl
j¼1
ffiffiffiffiffiffiffiffiffiffiλAi λ
Bj
qvAi ⋅v
Bj
� �2Xl
i¼1 λAi þ λBi
� 0B@
1CA
1=2
Bhattacharyya coefficient provides a means to comparetwo
covariance matrices derived from NMA or an ensem-ble of conformers
(e.g. simulation or X-ray conformers).For ENM normal modes the
covariance matrix (C) can becalculated as the pseudo inverse of the
mode eigenvectors:
C ¼X3N−6i¼1
1λivivTi
where vi represents the ith eigenvector, λi the
correspondingeigenvalue, and N the number C-alpha atoms in the
proteinstructure (3 N-6 non-trivial modes). As formulated
byFuglebakk et al. [26,33], the Bhattacharyya coefficient canthen
be written as
BC ¼ exp − 12q
lnΛj j
QTCAQ�� �� QTCBQ�� ��� 1=2
!" #
where Q is the matrix of the principal components of(CA +CB)/2,
Λ is diagonal matrix containing the corre-sponding eigenvalues, and
q the number of modes neededto capture 90% of the variance of Q.
The Bhattacharyyacoefficient varies between 0 and 1, and equals to
1 ifthe covariance matrices (CA and CB) are identical.Squared Inner
Product (SIP) measures the linear
correlation between two atomic fluctuation profiles [33,34].It
varies between 0 and 1 and is defined as
SIP ¼ wTAwB
� 2wTAwA�
wTBwBð Þwhere wA and wB wB are vectors of length N containingthe
fluctuation value (e.g. RMSF) for each atom in proteinA and B,
respectively.
PCA of cross-correlation and covariance matricesNew
functionality facilitates PCA of residue-residue cross-correlations
and covariance matrices derived from ensembleNMA. This analysis can
be formulated as
BTΥB ¼ Γwhere Υ is a matrix containing the elements of the
Mcorrelation/covariance matrices (with one row per struc-ture), B
the eigenvectors and Γ the associated eigenvalues.Projection into
the sub-space defined by the largesteigenvectors enables clustering
of the structures basedon the largest variance within the
cross-correlation orcovariance matrices.
-
Skjærven et al. BMC Bioinformatics 2014, 15:399 Page 5 of
11http://www.biomedcentral.com/1471-2105/15/399
All similarity measures described above can be utilizedfor
clustering the ensemble of structures based on theirnormal modes.
Various clustering algorithms are available,such as k-means
clustering, as well as hierarchicalclustering using the Ward? s
minimum variance method, orsingle, complete and average linkage.
The application andcomparison of the described similarity measures
ispresented in Additional file 2.
Force constants variance weightingWe propose to incorporate
knowledge on the accessibleconformational ensemble (e.g. all
available X-ray structures)to lift the dependency of the force
constants on the singlestructure they were derived from. We weigh
the forceconstants with the variance of the pairwise
residuedistances derived from the ensemble of structures.The
weights (Wij) and the modified force constants(k’ij(r)) can then be
calculated as
Wij ¼ 1− Sijŝ� φ
k0ij rð Þ ¼ Wij⋅kij rð Þ
where Sij (the elements of matrix S) represents thevariance of
the distance between residues i and j in theensemble, ŝ is the
maximum of such variance for anypair of atoms, and φ is an optional
scaling factor. Theapplication of force constant weighting is
presented inAdditional file 1.
Identification of dynamic domainsAnalysis and identification of
dynamic domains, i.e. parts ofthe molecule that move as relatively
rigid entities within aconformational ensemble, is made available
through a newimplementation of the GeoStaS algorithm previously
pre-sented as a standalone Java program [35]. The algorithmrelies
on the identification of the best pairwise superimpos-ition of
atomic trajectories based on rotation and transla-tion in
quaternion space. The resulting atomic movementsimilarity matrix
provides a means for clustering the atomsin the system based on
their respective similarity. Thisapproach has the advantage of
capturing the potentialcorrelation in rotational motions of two
atoms placed onopposite sites, which may otherwise be found to be
anti-correlated in a standard cross-correlation analysis.
Theapplication of GeoStaS is demonstrated in Additional files 1and
2 for both single structure and ensemble NMA, as wellas for
ensembles of PDB structures and MD trajectories.
Correlation network analysisCorrelation network analysis can be
employed to identifyprotein segments with correlated motions. In
thisapproach, a weighted graph is constructed where eachresidue
represents a node and the weight of the connection
between nodes, i and j, represents their respective
cross-correlation value, cij, expressed by either the
Pearson-likeform [36], or the linear mutual information [37].
Herewe propose an approach related to that introduced bySethi et
al. [38], but using multiple correlation matricesderived from the
input ensemble instead of contact maps.Specifically, the
correlation matrix (C) is calculated foreach structure in the
ensemble NMA. Then, edges areadded for residue pairs with cij ≥ c0
across all experimentalstructures, where c0 is a constant. In
addition, edges areadded for residues where cij ≥ c0 for at least
one of thestructures and the Cα-Cα distance, dij, satisfies dij ≤
10 Åfor at least 75% of all conformations. Edges weights arethen
calculated with − log(〈cij〉), where 〈 ⋅ 〉 denotes theensemble
average. Girvan and Newman betweenessclustering [39] is then
performed to generate aggregatenodal clusters, or communities, that
are highly intra-connected but loosely inter-connected.
Visualization of theresulting network and community structures in
both 2Dand 3D along with additional clustering and analysisoptions
are also provided. See Additional file 4 for acomplete example of
the integration of ensemble NMAwith correlation network
analysis.
Results and discussionIn this section we demonstrate the
application of newBio3D functionality for analyzing functional
motions intwo distinct protein systems. Further examples, alongwith
executable code, are provided in Additional files 1,2, 3 and 4.
Cross-species analysis of DHFRDihydrofolate reductase (DHFR)
plays a critical role inpromoting cell growth and proliferation in
all organisms bycatalyzing the reaction of dihydrofolate to
tetrahydrofolate,an essential precursor for thymidylate synthesis
[40]. DHFRis a major target for several antibiotics and has been
subjectof extensive structural studies. There are currently
morethan 500 DHFR structures in the PDB including amultitude of
liganded states from a number of species.Here we demonstrate the
use of Bio3D to take fulladvantage of this large structural data
set when performingNMA. We first focus on the E. coli. DHFR
structuresbefore proceeding to a cross species analysis of all
availableDHFR structures.Following the workflow described in Figure
1 (see the
Package overview and architecture section), we collected all90
E. coli. DHFR structures from the PDB, performed aPCA to
investigate the major conformational variation, andcalculated the
normal modes of each structure to probe forpotential differences in
structural flexibility. The PCAreveals that the ensemble can be
divided into three majorgroups along their first two principal
components (whichcollectively account for 59% of the total
coordinate mean
-
Skjærven et al. BMC Bioinformatics 2014, 15:399 Page 6 of
11http://www.biomedcentral.com/1471-2105/15/399
square displacements, Figure 2A). These conformers dis-play
either a closed, occluded, or an open conformation oftwo active
site loops (termed the Met20 loop: residues 9-
Figure 2 Results of ensemble PCA and NMA on E. coli DHFR.(A)
Available PDB structures projected onto their first two
principalcomponents accounting for a total of 59% of the total
variance.(B) Comparison of mode fluctuations calculated for open
(black)and closed (red) conformations. The figure is generated by
automatedfunctions for plotting and the identification of areas of
significantdifferences in residue fluctuations between groups of
conformers (lightblue boxes). The locations of major secondary
structure elements areshown in the plot margins with β strands in
gray and α helices inblack. (C) Conformational ensemble obtained
from interpolating alongthe first five modes of all collected E.
coli structures. Domain analysison the generated ensemble reveals
the identification of two dynamicsub-domains colored red and blue,
respectively. See Additional file 2for full details and
corresponding code for this analysis.
24, and the F-G loop: residues 116-132). NMA revealsthat
structures obtaining an open conformation showenhanced flexibility
for the Met20 loop as compared toboth the closed and occluded
conformations (Figure 2B).Conversely, the F-G loop shows lower
fluctuation valuesfor the open conformation as compared to the
occludedstate (Additional File 2). These differences in mode
fluc-tuations highlight the importance of considering mul-tiple
conformers in NMA, which is greatly facilitated bythe Bio3D
package. Additional, domain analysis with thefunction geostas()
reveals the presence of two dynamicsub-domains corresponding to the
adenosine-bindingsub-domain and the loop sub-domain, respectively
(Fig-ure 2C). These domains are divided by a hinge region
cor-responding to residues Thr35 and Gln108, in agreementwith
previous studies [41]. This example demonstrateshow integrating
PCA, NMA and dynamic domain analysison E. coli. DHFR structures can
provide mechanisticinsight into protein dynamics of functional
relevance.Beginning with the knowledge of only one DHFR PDB
code, the complete PCA and NMA of the E. coli. DHFRensemble can
be performed with only a few lines of code:
## Search for related structures aa = get.seq(“1rx2_A”) blast =
blast.pdb(aa) hits = plot.blast(blast, cutoff=225)
## Download and split by chain ID files = get.pdb(hits,
split=TRUE)
## Structure/sequence alignment pdbs = pdbaln(files,
fit=TRUE)
## PCA and ensemble NMA pc = pca(pdbs) modes = nma(pdbs)
## Plot results plot(pc) plot(modes)
To detect more distantly related DHFR homologues webuilt a
hidden Markov model (HMM) from the PFAMmultiple sequence alignment
using the Bio3D interface toPFAM and HMMER (see the Package
overview andarchitecture section). The resulting HMM was used in
anew search of the PDB that identified a total of 33 speciesfrom
bacteria, archaea, and eukaryotes, showing a pairwisesequence
identity down to 21%. NMA was carried out on197 of these
structures. The resulting fluctuation profiles
-
Skjærven et al. BMC Bioinformatics 2014, 15:399 Page 7 of
11http://www.biomedcentral.com/1471-2105/15/399
are plotted for each species along with the sequence
con-servation in Figure 3A-B. The plot reveals an overall simi-lar
trend of residue fluctuations between the speciesdespite their low
sequence identity. While the functionallyimportant Met20 loop
display a conserved flexibility trendfor most of the species, the
E. coli structures haveenhanced fluctuations in this region (region
I, Figure 3).This has previously been attributed to distinct
functionalmechanism for ligand flux: while E. coli DHFR relies
onloop flexibility for the opening of the active site, H.
sapiensDHFR accomplishes this by subtle subdomain rotational
Figure 3 Cross-species normal modes analysis of DHFR. (A)
Sequenceprofiles for selected species of DHFR. Shaded blue regions
depict areas discspecific species. The region shaded in light red
depict the Met20 loop in Elocation of major secondary structure
elements in E. coli DHFR are also sho(C) A visual comparison of
mode fluctuations between DHFR from E. coli atube colored blue (low
fluctuations), white (moderate fluctuations) to red (corresponding
code for this analysis.
hinge motions [41]. Other important differences includeenhanced
loop fluctuations in H. sapiens DHFR, which arenot evident in the
bacterial species (residues 43-50 and126-131 for human DHFR; Figure
3). These fluctuationshave been suggested to be important for
facilitating thehinge motions in H. sapiens DHFR [41].
Interestingly, theflexibility pattern of the human DHFR 43-50 loop
is sharedwith two fungal variants: C. albicans and C.
glabrata(region II, Figure 3). A similar trend is apparent for
residues62-64 in human DHFR. This flexible loop is also sharedwith
the bacterial M. tubercolosi species (region III), but is
conservation of the collected DHFR species. (B) Aligned
fluctuationussed in the text showing different fluctuation patterns
between. coli DHFR and the corresponding loop in the remaining
species. Thewn in the plot margins with β strands in gray and α
helices in black.nd H. sapiens. Fluctuation magnitude is
represented by thin to thicklarge fluctuations). See Additional
file 3 for full details and
-
Skjærven et al. BMC Bioinformatics 2014, 15:399 Page 8 of
11http://www.biomedcentral.com/1471-2105/15/399
missing in the four other bacterial species. Finally, the
twofungal species display an additional and flexible surface
loop(residues 139-150 in C. albicans DHFR; region IV), while
C.glabrata contains residues 164-178 specific for this
species(region V). This example demonstrates how Bio3D version2.0
can facilitate the investigation of common and divergentprotein
structural dynamics in large protein superfamilies.
Heterotrimeric G-proteinsApplying ensemble NMA to heterotrimeric
G-proteinα-subunits (Gα) reveals nucleotide dependent
structuraldynamic features of functional relevance. Gα
undergoescycles of nucleotide-dependent conformational
rearrange-ments to couple cell surface receptors to
downstreameffectors and signaling cascades that control diverse
cellularprocesses. These process range from movement anddivision to
differentiation and neuronal activity. Interactionwith activated
receptor promotes the exchange of GDP forGTP on Gα and its
separation from its βγ subunit partners(Gβγ). Both isolated Gα and
Gβγ can then interact andactivate downstream effectors. GTP
hydrolysis deactivatesGα, which re-associates with Gβγ effectively
completingthe cycle.
Figure 4 Investigating functional dynamics in heterotrimeric
G-proteidynamically coupled sub-domains (colored regions) from
correlation netwowith thick black lines. (C) Characterization of
distinct GTP-active and GDP-inanalysis reveals structural regions
with significantly distinct flexibilities (highligbetween the
active (red) and inactive (green) states. Full details for the
reprodstates can be found in the Additional file 1.
In the current application, we collected 53 PDB structuresof Gα
(from application of the blast.pdb() function). Thesestructures
were aligned with the function pdbaln() and theirmodes of motion
calculated with nma() (Figure 1 andAdditional file 1). Results from
RMSIP, fluctuation,and correlation analysis indicate that the
structuraldynamics are nucleotide state dependent (Figure 4).The
modes of motion clearly distinguish the GTP (active)and GDP
(inactive) states (Figure 4C). Predicted residuefluctuations reveal
areas of conserved dynamics inter-spersed with areas of
significantly distinct flexibilities inthe active and inactive
states (Figure 4D). Specifically, theP-loop and switch I, switch II
and switch III regions arepredicted to be significantly more
flexible in the GDPthan in GTP state. These results are consistent
withour previous structural and MD simulation studies, inwhich
these regions were found to be strongly coupledonly in the active
GTP state [42]. The stabilized P-loopand switch regions are thus a
potential prerequisite forGTP hydrolysis and the binding of
effectors.It has been suggested that the activation mechanism
of
Gα involves a large domain opening that facilitatesGDP/GTP
exchange [43,44]. Applying NMA to a predicted
ns. (A) Prediction of large-scale opening motions. (B)
Prediction ofrk analysis of NMA results. Inter-subdomain couplings
are highlightedactive states from a clustering of NMA RMSIP
results. (D) Fluctuationhted with a blue shaded background are
sites with a p-value < 0.005)uction of this analysis along with
PCA that distinguishes GDP and GTP
-
Skjærven et al. BMC Bioinformatics 2014, 15:399 Page 9 of
11http://www.biomedcentral.com/1471-2105/15/399
open form of Gα [42], highlights the large flexibility of
thehelical domain and captures this opening closing motion(Figure
4A). Combining NMA results with correlationnetwork analysis
methods, as implemented in thecna() function, reveals dynamically
coupled subdo-mains that may facilitate the allosteric coupling of
receptorand nucleotide binding sites (Figure 4B and Additional
file4). In summary, this example demonstrates the potential
ofensemble NMA for characterizing key structural dynamicmechanisms
in G proteins and other biomolecular systems.
Related solutions and future developmentsAs noted in the
introduction, a number of previouslyimplemented software solutions
(including multipleweb-servers [10-12,45] and standalone software
packages[13-15,46]) offer single structure NMA or MD analysis.
Table 1 Related software for analysis of protein structural
dy
MMTK 2.7 ProDy 1.5
Dependencies Python, NumPy,ScientificPython
Python, NumPyMatplotLib
Reading and analysis of molecularsequences
No Yes
Reading and analysis of multiplemolecular structures
No Yes
Reading and analysis of binary MDsimulation trajectories
Yes Yes
Biomolecular database integration No PDB, PFAMa
Energy minimization and MD Yes No
Standard NMA Yes Yes
Ensemble NMA across heterogeneousstructures
No No
Forcefields for NMA C-alpha, ANM,Amber all-atom
GNM/ANM,Custom
Ensemble PCA across heterogeneousstructures
No Yes
Correlation network analysis fromNMA and MD
No No
Dynamic domain analysis No No
Sequence alignment No No
Structure alignment Yes Yes
Advanced statistical analysis No No
Permits both interactive and batchanalysis
Yes Yes
Open source code available Yes Yes
Multicore compatibility Yes No
GUI No Nof
aRead and search functionality.bRead-only functionality from the
PDB.cRead, search, and annotation functionality, including enhanced
search capabilitiesdSTM: Spring Tensor Model; pANM: power ANM;
nnANM: nearest neighbor ANM; meDependences are not open source.fVMD
plugin NMWiz available for single molecule NMA.gWeb interface for
ensemble PCA and NMA in development.
These however typically lack extensive coupling to
differentbiomolecular databases and methods for evolutionary
andcomparative analysis of large sequence and structuraldatasets
(see Table 1). This lack of integrated functionalityimpedes
efficient exploratory analysis of sequence, structure,dynamics
relationships. Bio3D version 2.0 now integratesfunctionality for
searching and fetching data from majorsequence/structure databases,
sequence/structure alignmentand conservation analysis,
high-throughput ensemble NMAand PCA of heterogeneous structures,
protein structurenetwork analysis and many commonly used functions
forsimulation analysis. The package also includes
specificallytailored plotting and visualization functionality as
well ascoupling to the well-developed R environment for
statisticalcomputing and graphics. Bio3D thus offers
unparalleledcapabilities for both exploratory interactive and
large-scale
namics
MAVEN 1.2 WebNM@ 2.0 Bio3D 2.0
, Matlab ComponentRuntime (MCR)
Webbrowser
R, Muscle
No No Yes
Yes Yes Yes
No No Yes
Nob Nob PDB, PFAM,UNIPROT, NRc
No No No
Yes Yes Yes
No Yes Yes
GNM/ANM, pANM, STM,nnANM, mcgANM, Customd
C-alpha C-alpha, ANM, pfANMsdENM, REACH, Custom
Identical structures only No Yes
No No Yes
No No Yes
No No Yes
No No Yes
No No Yes
No Yes Yes
Yese No Yes
No No Yes
Yes Webserver Nog
across multiple databases.cgANM: mixed coarse graining ANM.
-
Skjærven et al. BMC Bioinformatics 2014, 15:399 Page 10 of
11http://www.biomedcentral.com/1471-2105/15/399
batch analysis of structural dynamic mechanisms in biomo-lecular
systems.Current and future development of Bio3D (see: https://
bitbucket.org/Grantlab/bio3d) includes implementation
ofadditional 3D visualization functionality, enhancedcompatibility
with the AMBER package [47], and furtherparallelization and
optimization of structural alignmentmethods using graphical
processing units (GPUs). We alsoplan to develop a web-interface and
API for ensemble NMAand PCA to make this functionality more widely
accessible.Finally, we envisage the development of new tools for
struc-tural dynamic mapping of clinical variants from next
gener-ation sequencing data and integration with the
Bioconductorproject [48] and tools for analysis of various omics
data.
ConclusionBio3D version 2.0 provides a versatile integrated
environ-ment for protein structural and evolutionary analysis
withunique capabilities including high-throughput ensembleNMA for
examining the dynamics of evolutionary relatedprotein structures; a
convenient interface for accessing mul-tiple ENM force fields; and
a direct integration with a largenumber of functions for sequence,
structure and simulationanalysis. The package is implemented in the
R environmentand thus couples to extensive graphical and
statisticalcapabilities along with a powerful user-friendly
interactiveprogramming environment that, together with
Bio3D,enables both exploratory structural bioinformatics
analysisand automated batch analysis of large datasets.
Availability and requirements
Project name: Bio3DProject home page:
http://thegrantlab.org/bio3dOperating system(s): Platform
independentProgramming language: ROther requirements: R > =
3.0.0License: GPL2Any restrictions to use by non-academics:
none
Additional files
Additional file 1: Comprehensive tutorials for traditional
singlestructure and new ensemble NMA on Heterotrimeric G-proteins
andother systems.
Additional file 2: E. coli DHFR ensemble NMA and PCA, including
acomparison of implemented similarity measures.
Additional file 3: Species wide NMA of the DHFR superfamily.
Additional file 4: Complete example of the integration of
ensembleNMA with correlation network analysis.
AbbreviationsCNA: Correlation network analysis; DHFR:
Dihydrofolate reductase;ENM: Elastic network model; MD: Molecular
dynamics; NMA: Normal modeanalysis; PCA: Principal component
analysis; RMSIP: Root mean square innerproduct.
Competing interestsThe authors declare that they have no
competing interests.
Author contributionsConceived and designed the study: LS, XY and
BJG. Performed the study: LSand XY. Implementation: LS and XY (NMA
functionality); XY, GS and BJG(CNA functionality). Analyzed and
interpreted the data: LS, XY and BJG.Wrote the paper and the
attached vignettes: LS, XY and BJG. All authors readand approved
the final manuscript.
AcknowledgementsWe thank Edvin Fuglebakk and Julia Romanowska
(University of Bergen,Norway) as well as the Bio3D user community
for valuable discussions andsoftware testing. We acknowledge the
University of Bergen (LS) andUniversity of Michigan (XY, GS and
BJG) for funding.
Author details1Department of Biomedicine, University of Bergen,
Bergen, Norway.2Structural and Computational Biology Unit, European
Molecular BiologyLaboratory, Heidelberg, Germany. 3Department of
Computational Medicineand Bioinformatics, University of Michigan,
Ann Arbor, Michigan, USA.
Received: 7 October 2014 Accepted: 26 November 2014
References1. Teilum K, Olsen JG, Kragelund BB: Functional
aspects of protein flexibility.
Cell Mol Life Sci 2009, 66:2231–2247.2. Henzler-Wildman K, Kern
D: Dynamic personalities of proteins. Nature 2007,
450:964–972.3. Henzler-Wildman KA, Thai V, Lei M, Ott M,
Wolf-Watz M, Fenn T, Pozharski E,
Wilson MA, Petsko GA, Karplus M, Hübner CG, Kern D: Intrinsic
motionsalong an enzymatic reaction trajectory. Nature 2007,
450:838–844.
4. Vale RD, Milligan RA: The way things move: looking under the
hood ofmolecular motor proteins. Science 2000, 288:88–95.
5. Yébenes H, Mesa P, Muñoz IG, Montoya G, Valpuesta JM:
Chaperonins: tworings for folding. Trends Biochem Sci 2011,
36:424–432.
6. Smock RG, Gierasch LM: Sending signals dynamically. Science
2009,324:198–203.
7. Karplus M, McCammon JA: Molecular dynamics simulations
ofbiomolecules. Nat Struct Biol 2002, 9:646–652.
8. Lee EH, Hsin J, Sotomayor M, Comellas G, Schulten K:
Discovery throughthe computational microscope. Structure 2009,
17:1295–1306.
9. Skjaerven L, Hollup SM, Reuter N: Normal mode analysis for
proteins. J MolStruct (THEOCHEM) 2009, 898:42–48.
10. Suhre K, Sanejouand Y-H: EN: a normal mode web server for
proteinmovement analysis and the generation of templates for
molecularreplacement. Nucleic Acids Res 2004, 32:W610–W614.
11. Krüger DM, Ahmed A, Gohlke H: NMSim web server: integrated
approachfor normal mode-based geometric simulations of biologically
relevantconformational transitions in proteins. Nucleic Acids Res
2012,40:W310–W316.
12. Eyal E, Yang L-W, Bahar I: Anisotropic network model:
systematic evaluationand a new web interface. Bioinformatics 2006,
22:2619–2627.
13. Hinsen K: The molecular modeling toolkit: a new approach to
molecularsimulations. J Comput Chem 2000, 21:79–85.
14. Bakan A, Meireles LM, Bahar I: ProDy: protein dynamics
inferred fromtheory and experiments. Bioinformatics 2011,
27:1575–1577.
15. Zimmermann MT, Kloczkowski A, Jernigan RL: MAVENs: motion
analysisand visualization of elastic networks and structural
ensembles.BMC Bioinformatics 2011, 12:264.
16. Grant B, Rodrigues A, ElSawy KM, McCammon JA, Caves LSD:
Bio3d: an Rpackage for the comparative analysis of protein
structures. Bioinformatics2006, 22:2695–2696.
17. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic
local alignmentsearch tool. J Mol Biol 1990, 215:403–410.
18. Finn RD, Clements J, Eddy SR: HMMER web server: interactive
sequencesimilarity searching. Nucleic Acids Res 2011,
39:W29–W37.
19. Atilgan AR, Durell SR, Jernigan RL, Demirel MC, Keskin O,
Bahar I: Anisotropyof fluctuation dynamics of proteins with an
elastic network model.Biophys J 2001, 80:505–515.
https://bitbucket.org/Grantlab/bio3dhttps://bitbucket.org/Grantlab/bio3dhttp://thegrantlab.org/bio3dhttp://www.biomedcentral.com/content/supplementary/s12859-014-0399-6-s1.pdfhttp://www.biomedcentral.com/content/supplementary/s12859-014-0399-6-s2.pdfhttp://www.biomedcentral.com/content/supplementary/s12859-014-0399-6-s3.pdfhttp://www.biomedcentral.com/content/supplementary/s12859-014-0399-6-s4.pdf
-
Skjærven et al. BMC Bioinformatics 2014, 15:399 Page 11 of
11http://www.biomedcentral.com/1471-2105/15/399
20. Yang L, Song G, Jernigan RL: Protein elastic network models
and theranges of cooperativity. Proc Natl Acad Sci U S A 2009,
106:12347–12352.
21. Hinsen K, Petrescu AJ, Dellerue S, Bellissent-Funel MC,
Kneller GR: Harmonicityin slow protein dynamics. Chem Phys 2000,
261:25–37.
22. Moritsugu K, Smith JC: Coarse-grained biomolecular
simulation withREACH: realistic extension algorithm via covariance
Hessian. Biophys J2007, 93:3460–3469.
23. Dehouck Y, Mikhailov AS: Effective harmonic potentials:
insights into theinternal cooperativity and sequence-specificity of
protein dynamics.PLoS Comput Biol 2013, 9:e1003209.
24. Tirion M: Large amplitude elastic motions in proteins from a
single-parameter,atomic analysis. Phys Rev Lett 1996,
77:1905–1908.
25. Hinsen K: Analysis of domain motions by approximate normal
modecalculations. Proteins 1998, 33:417–429.
26. Fuglebakk E, Echave J, Reuter N: Measuring and comparing
structuralfluctuation patterns in large protein datasets.
Bioinformatics 2012,28:2431–2440.
27. Caves LSD, Evanseck JD, Karplus M: Locally accessible
conformations ofproteins: multiple molecular dynamics simulations
of crambin. Protein Sci1998, 7:649–666.
28. Gorfe AA, Grant BJ, McCammon JA: Mapping the nucleotide
andisoform-dependent structural and dynamical features of ras
proteins.Structure 2008, 16:885–896.
29. Grant BJ, McCammon JA, Caves LSD, Cross RA: Multivariate
analysis ofconserved sequence-structure relationships in kinesins:
coupling of theactive site and a tubulin-binding sub-domain. J Mol
Biol 2007,368:1231–1248.
30. Van Aalten DMF, de Groot BL, Findlay JBC, Berendsen HJC,
Amadei A,VanAalten DMF, DeGroot BL: A comparison of techniques for
calculatingprotein essential dynamics. J Comput Chem 1997,
18:169–181.
31. Amadei A, Ceruso MA, Di Nola A: On the convergence of
theconformational coordinates basis set obtained by the essential
dynamicsanalysis of proteins’ molecular dynamics simulations.
Proteins 1999,36:419–424.
32. Romo TD, Grossfield A: Validating and improving elastic
network modelswith molecular dynamics simulations. Proteins 2011,
79:23–34.
33. Fuglebakk E, Reuter N, Hinsen K: Evaluation of protein
elastic networkmodels based on an analysis of collective motions. J
Chem TheoryComput 2013, 9:5618–5628.
34. Kundu S, Melton JS, Sorensen DC, Phillips GN: Dynamics of
proteins incrystals: comparison of experiment with simple models.
Biophys J 2002,83:723–732.
35. Romanowska J, Nowinski KS, Trylska J: Determining
geometrically stabledomains in molecular conformation sets. J Chem
Theory Comput 2012,8:2588–2599.
36. Ichiye T, Karplus M: Collective motions in proteins: a
covariance analysisof atomic fluctuations in molecular dynamics and
normal modesimulations. Proteins 1991, 11:205–217.
37. Lange OF, Grubmüller H: Generalized correlation for
biomoleculardynamics. Proteins 2006, 62:1053–1061.
38. Sethi A, Eargle J, Black AA, Luthey-Schulten Z: Dynamical
networks intRNA:protein complexes. Proc Natl Acad Sci U S A 2009,
106:6620–6625.
39. Girvan M, Newman MEJ: Community structure in social and
biologicalnetworks. Proc Natl Acad Sci U S A 2002,
99:7821–7826.
40. Schnell JR, Dyson HJ, Wright PE: Structure, dynamics, and
catalyticfunction of dihydrofolate reductase. Annu Rev Biophys
Biomol Struct 2004,33:119–140.
41. Bhabha G, Ekiert DC, Jennewein M, Zmasek CM, Tuttle LM,
Kroon G, DysonHJ, Godzik A, Wilson I a, Wright PE: Divergent
evolution of proteinconformational dynamics in dihydrofolate
reductase. Nat Struct Mol Biol2013, 20:1243–1249.
42. Yao X-Q, Grant BJ: Domain-opening and dynamic coupling in
the α-subunitof heterotrimeric G proteins. Biophys J 2013,
105:L08–L10.
43. Rasmussen SGF, DeVree BT, Zou Y, Kruse AC, Chung KY, Kobilka
TS, Thian FS,Chae PS, Pardon E, Calinski D, Mathiesen JM, Shah STA,
Lyons J a, Caffrey M,Gellman SH, Steyaert J, Skiniotis G, Weis WI,
Sunahara RK, Kobilka BK: Crystalstructure of the β2 adrenergic
receptor-Gs protein complex. Nature 2011,477:549–555.
44. Westfield GH, Rasmussen SGF, Su M, Dutta S, DeVree BT, Chung
KY, CalinskiD, Velez-Ruiz G, Oleskie AN, Pardon E, Chae PS, Liu T,
Li S, Woods VL,Steyaert J, Kobilka BK, Sunahara RK, Skiniotis G:
Structural flexibility of theG alpha s alpha-helical domain in the
beta2-adrenoceptor Gs complex.Proc Natl Acad Sci U S A 2011,
108:16086–16091.
45. Hollup SM, Salensminde G, Reuter N: WEBnm@: a web
application fornormal mode analyses of proteins. BMC Bioinformatics
2005, 6:52.
46. Roe DR, Cheatham TE: PTRAJ and CPPTRAJ: Software for
Processing andAnalysis of Molecular Dynamics Trajectory Data. J
Chem Theory Comput2013, 9:3084–3095.
47. Salomon-Ferrer R, Case D a, Walker RC: An overview of the
Amberbiomolecular simulation package. WIREs Comput Mol Sci 2013,
3:198–210.
48. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M,
Dudoit S, Ellis B,Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T,
Huber W, Iacus S, Irizarry R,Leisch F, Li C, Maechler M, Rossini
AJ, Sawitzki G, Smith C, Smyth G, TierneyL, Yang JYH, Zhang J:
Bioconductor: open software development forcomputational biology
and bioinformatics. Genome Biol 2004, 5:R80.
doi:10.1186/s12859-014-0399-6Cite this article as: Skjærven et
al.: Integrating protein structuraldynamics and evolutionary
analysis with Bio3D. BMC Bioinformatics2014 15:399.
Submit your next manuscript to BioMed Centraland take full
advantage of:
• Convenient online submission
• Thorough peer review
• No space constraints or color figure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution
Submit your manuscript at www.biomedcentral.com/submit
AbstractBackgroundResultsConclusions
BackgroundPackage overview and architecture
ImplementationElastic network modelsEnsemble NMAEnsemble
PCASimilarity measuresPCA of cross-correlation and covariance
matricesForce constants variance weightingIdentification of dynamic
domainsCorrelation network analysis
Results and discussionCross-species analysis of
DHFRHeterotrimeric G-proteinsRelated solutions and future
developments
ConclusionAvailability and requirementsAdditional
filesAbbreviationsCompeting interestsAuthor
contributionsAcknowledgementsAuthor detailsReferences