Quantifying Conformational Ensemble Changes in Proteins ...labs.cas.usf.edu/cbb/Papers/ISCB2016_poster_Mohsen.pdf · Mohsen Botlani, Ahnaf Siddiqui and Sameer Varma Department of

Quantifying Conformational Ensemble Changes in Proteins Using Inverse Machine LearningMohsen Botlani, Ahnaf Siddiqui and Sameer Varma

Department of Cell Biology, Microbiolgy and Molecular BiologyUniversity of South Florida, FL-33620

Background: Protein activities are regulated tightly in biological environments. An understanding of their regulatory mechanisms entails assessment of their various states, including active and inactive states. For many proteins, their states can be distinguished based on their minimum-energy conformations since, the magnitudes of thermal fluctuations, or dynamics, are negligible compared to the differences in minimum-energy structures. This approximation, however, breaks down for several other proteins. The states of these proteins can only be distinguished categorically from each other when their finite-temperature conformational ensembles are considered alongside their minimum-energy structures. The list of such proteins has grown rapidly in the last decade, which now includes GPCRs, PDZ domains, nuclear transcription factors, heat shock proteins, T-cell receptors and viral attachment proteins. Applicability of molecular simulations toward understanding mechanisms in this latter category of proteins requires development of new methods that can deal with high-dimensional conformational ensemble data. Description: The traditional approach to compare protein conformational ensembles is to compare their respective summary statistics. However, if a subset of the summary statistics from the two ensembles is found to be identical, it does not imply that the remaining summary statistics will also be identical. The general problem of finding and choosing a feature that appropriately distinguishes ensembles can be overcome by comparing ensembles directly against each other and prior to any dimensionality reduction. We have developed a method to accomplish just that – it performs excellently for both Gaussian and non-Gaussian distributions. The difference between ensembles is computed by solving the inverse machine learning problem and in terms of a metric that satisfies the conditions set forth by the zeroth law of thermodynamics. Conclusions: Such a quantification permits statistical analyses and quantitative data mining necessary for establishing causality in protein functional regulation. We have applied this method to (a) quantitatively understand the effect of ligand binding on the structure and dynamics of a viral protein whose function is controlled by dynamic allostery; (b) understand the role of water in the inception of allosteric signals; (c) determine intersecting signaling pathways. This method is available under standard GNU license on SimTk.(https://simtk.org/projects/conf_ensembles).

1. Leighty RE and Varma S. J Chem. Theory and Comput, 9: 868-875, 2013.2. Varma S, Botlani M and Leighty RE. Proteins, 82: 3241-3254, 2014. 3. Dutta P, Botlani M and Varma S. J Phys. Chem. B, 118: 14795-14807, 2014.4. Dutta P, Siddiqui A, Botlani M and Varma S. Biophys. J, 2016, Under revision.

Acknowledgments: All simulations were carried out at the Research Computing center of the University of South Florida.

TextText

1)

2)

Intersecting signaling pathways:

Conformational sampling over 3 collective variables

Traditionally, a support vector machine (SVM) is used for binary classification. It is first trained on a set of instances for which their group identities are known, and then used for predicting the group identities of unclassified instances. In our approach, we train the SVM to recognize the difference of two n-particle conformational ensembles, but instead of using the trained SVM for predictive purposes, we utilize the mathematical properties of the underlying classification function to obtain a physically meaningful quantitative estimate for the difference between the ensembles. The method is trained on Gaussian distributions, and works excellently without need for any data fitting. From a theoretical standpoint, the method should also work for multi-Gaussian distributions, and by extension, for any distribution, because the overlap between two multi-Gaussian distributions is essentially a sum of overlaps between Gaussian distributions,

Residues that are close to the diagonal undergo shifts primarily in backbone positions. Residues that lie below the diagonal undergo changes in side chain orientations and/or conformational entropy. Residues that lie above the diagonal represent cases where backbone deviations are swamped by smaller changes in whole residue deviations.

Example: Effect of force field on ligand-induced conformational ensemble shifts. and are computed, respectively, from stochastic dynamics simulations in explicit and implicit solvent.

Biophysical Journal: Dutta et al. 5

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Etas

1−Overlap

−2 0 2 4 6 8 10 120

0.2

0.4

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Etas

1−Overlap

−2 0 2 4 6 80

0.2

0.4

0.55

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Etas

1−Overlap

−2 0 2 4 6 8 10 120

0.2

0.4

0.5

(a) Bimodal (b) Trimodal (c) Quadrimodal

MAE = 4.2% ρ = 0.99

MAE = 5.7% ρ = 0.98

MAE = 5.8% ρ = 0.97

Figure 2: Increase font sizes for the main plots. Increase font sizes for the insets. Add MAE and Corr values

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 10 0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

Figure 3: Performance of ⌘ estimated from F (r) against its exact value (1� ||R\R0||). For each of the three types of multimodaldistributions, (a) bimodal distributions (R =

P2i=1 ci

fi

), (b) trimodal distributions (R =P3

i=1 ci

fi

), and (c) quadrimodaldistributions (R =

P4i=1 ci

fi

), we generate 400 random pairs (R, R0) by modulating the weighting coefficients c as well asthe attributes of Gaussian functions f . Representative distribution pairs are shown as insets, where the shaded portions indicatethe overlap (||R \ R0||) between the distributions. Performance is quantified using mean absolute errors (MAE) and Pearsoncorrelation coefficients (⇢).

We also note from Fig. 4a that while the two simulations of the ephrin bound state yield identical RBD-RBD orientations,the two simulations of the ephrin free state yield slightly different RBD-RBD orientations. To understand the latter, we visualizein Fig. 4b the RBD-RBD interfaces obtained from these simulations in the context of the position of the FAD. We note that theFAD will interact more extensively with the RBDs in the ephrin free state, as compared to the ephrin bound state. Therefore, thereason the two simulations of the ephrin free state produce slightly different RBD-RBD interfaces could be due to the absence ofthe RBD-FAD interface in our simulations. Nevertheless, the primary outcome of these simulation is that ephrin binding inducesa significant change in the RBD-RBD orientation.

0 50 100 150 2000

0.785

1.570.79

1.58

2.364.5

5.0

5.5

(nm

)

π/2

π/4

3π/4

π/2

π/4

Time (ns)

(rad)

(rad)

(a) (b)

Ephrin binding

Ephrin free stateEphrin bound state

0

FAD

RBD-I

RBD-II

RBD-IIRBD-I

Ephrin

θtilt

â

θroll

â

â′

â′

dCoM

Figure 4: (a) Time evolutions of collective variables that describe the interface between the two RBDs of a dimer. The two linesfor each of the ephrin free and ephrin bound states indicate two separate MD simulations. d

CoM

is the distance between thecenters of masses (CoM) of the backbone atoms of the two RBDs. ✓

tilt

is the angle between the central axes, a and a0, of thetwo RBDs. ✓

roll

is the angle of rotation of the RBD about its central axis. The geometrical definitions of ✓tilt

and ✓roll

areprovided in Fig. S5 in the Supporting Material. (b) Final snapshots of the RBD-RBD interface in MD simulations. Note thattwo superimposed structures are shown for the ephrin free state, as the two simulations in the ephrin free state produced slightlydifferent RBD-RBD geometries. The location of the FAD relative to the RBD-RBD dimer is depicted according to structure ofthe full length ectodomain proposed by Broder and coworkers (5), which was homology modeled on the X-ray structures of theG analogs in the Newcastle Disease Virus and the parainfluenza virus (4, 11, 12).

Biophysical Journal 00(00) 1–0

Inactive Active

InactiveActive

Ligand

0

0.2

0.4

0.6

0.8

1.0

0 0.2 0.4 0.6 0.8 1.0ηExplicit

η Implicit

ρ = 0.28

10 30

G-B2G-B3

Bio

physica

lJo

urn

al

Vo

lu

me

:0

0M

on

th

Ye

ar

0–

01

Wat

erdy

nam

ics

atpr

otei

n-pr

otei

nin

terf

aces

:Am

olec

ular

dyna

mic

sst

udy

ofvi

rus-

host

rece

ptor

com

plex

es

Priy

anka

Dut

ta,M

ohse

nB

otla

nian

dSa

mee

rVar

ma

Dep

artm

ento

fCel

lBio

logy

,Mic

robi

olog

yan

dM

olec

ular

Bio

logy

,Uni

vers

ityof

Sout

hFl

orid

a,42

02E.

Fow

lerA

ve.,

Tam

pa,F

L-33

620,

Uni

ted

Stat

esof

Am

eric

a

Ab

stract

The

dyna

mic

alpr

oper

ties

ofw

ater

atbi

olog

ical

inte

rfac

esar

edi

ffer

ent

from

thos

ein

bulk

wat

er.E

xper

imen

tsas

wel

las

sim

ulat

ions

indi

cate

that

wat

erdi

ffus

esan

dor

ient

sat

rate

sth

atde

pend

onbo

thth

ech

emis

tryas

wel

las

the

topo

logy

ofth

ein

terf

ace.

Her

ew

eut

ilize

mol

ecul

ardy

nam

ics

sim

ulat

ions

tode

term

ine

the

natu

rean

dex

tent

tow

hich

the

dyna

mic

alpr

oper

-tie

sof

wat

erar

esh

ifted

from

thei

rbul

kva

lues

whe

nth

eyoc

cupy

inte

rstit

ialr

egio

nsbe

twee

ntw

opr

otei

ns.W

eco

nsid

ertw

ona

tura

lpro

tein

-pro

tein

com

plex

es,o

nein

whi

chth

eN

ipah

viru

sG

prot

ein

bind

sto

cellu

lare

phrin

B2,

and

the

othe

rin

whi

chth

esa

me

Gpr

otei

nbi

nds

toep

hrin

B3.

Thes

epr

otei

n-pr

otei

nin

tera

ctio

nsco

nstit

ute

the

first

step

inN

ipah

infe

ctio

n.W

efin

dth

atde

spite

the

low

sequ

ence

iden

tity

of50

%be

twee

nep

hrin

sB2

and

B3,

the

dyna

mic

alpr

oper

tieso

fint

erst

itial

wat

ersi

nth

etw

oco

mpl

exes

are

sim

ilar.

Inbo

thca

ses,

we

find

that

the

inte

rstit

ialw

ater

sdi

ffus

ete

ntim

essl

ower

com

pare

dto

bulk

wat

er.

Inad

ditio

n,de

spite

thei

rre

solu

tion

incr

ysta

lstru

ctur

es,m

ore

than

95%

ofth

ew

ater

sin

the

inte

rstit

ialr

egio

nsex

chan

gew

ithth

ebu

lkw

ithin

150

ns.T

hein

ters

titia

lwat

ers

also

exhi

bitd

ipol

ere

laxa

tion

times

and

hydr

ogen

bond

lifet

imes

anor

der

inm

agni

tude

long

erth

anbu

lkw

ater

.The

sede

viat

ions

from

bulk

valu

esar

ege

nera

llym

uch

larg

erth

anth

ose

obse

rved

atpr

otei

n-w

ater

inte

rfac

es.T

oga

uge

the

func

tiona

lrel

evan

ceof

the

inte

rstit

ialw

ater

,we

exam

ine

quan

titat

ivel

yho

wim

plic

itso

lven

tmod

els

com

pare

agai

nste

xplic

itso

lven

tmod

els

inpr

oduc

ing

ephr

in-in

duce

dsh

ifts

inth

eG

confi

gura

tiona

lden

sity

.Ep

hrin

-indu

ced

shift

sin

the

Gco

nfigu

ratio

nald

ensi

tyar

ecr

itica

lto

the

allo

ster

icre

gula

tion

ofvi

ralf

usio

n.W

efin

dth

atth

etw

om

etho

dsyi

eld

strik

ingl

ydi

ffer

enti

nduc

edch

ange

sin

the

Gco

nfigu

ratio

nald

ensi

ty,w

hich

sugg

ests

that

the

inte

rstit

ial

wat

ers

may

also

cont

ribut

eto

the

allo

ster

icsi

gnal

ing,

and

ther

efor

e,ar

efu

nctio

nally

impo

rtant

.

Inse

rtR

ecei

ved

forp

ublic

atio

nD

ate

and

infin

alfo

rmD

ate.

Cor

resp

onda

nce:

svar

ma@

usf.e

du

In

tro

du

ctio

n

The

dyna

mic

alpr

oper

ties

ofw

ater

atbi

olog

ical

inte

rfac

esar

edi

ffer

entf

rom

thos

ein

bulk

wat

er(?

??

??

??

??

??

).H

owar

eth

eydi

ffer

ent?

Inge

nera

l,th

efu

ndam

enta

ltre

ndob

serv

edfr

omex

perim

ents

and

sim

ulat

ions

isth

atw

ater

diff

uses

,rel

axes

and

orie

nts

slow

erat

prot

ein-

wat

eran

dlip

id-w

ater

inte

rfac

es,a

sco

mpa

red

toin

the

bulk

.1.

Firs

thyd

ratio

nsh

ello

fpro

tein

sin

dens

erIM

PORT

AN

TFO

RM

ETH

OD

S:cr

ysta

lWAT

ERs

inB

2an

dno

tB3.

Sore

tain

ing

the

crys

talw

ater

sha

sno

teff

ecto

nth

eov

eral

lpro

perti

es.

——

——

Equa

tions

:⇢/

⇢ 0r

(A)

——

——

Prob

ing

the

fold

ing

and

unfo

ldin

gpr

oces

ses

ofpr

otei

nsas

afu

nctio

nof

tem

pera

ture

isa

maj

orch

alle

nge

inbi

ophy

sics

.H

ere

we

exam

ine

the

effe

cts

ofte

mpe

ratu

resp

ikes

that

heat

and

cool

prot

eins

with

inte

nsof

nano

seco

nds.

Our

resu

ltssh

owth

ese

spik

esar

eca

pabl

eof

caus

ing

irrev

ersi

ble

chan

ges

suffi

cien

tto

elim

inat

epr

otei

nac

tivity

.

©2

01

3T

he

Au

th

ors

0006

-349

5/08

/09/

2624

/12

$2.0

0do

i:10

.152

9/bi

ophy

sj.1

06.0

9094

4

Biophysical Journal Volume: 00 Month Year 0–0 1

Water dynamics at protein-protein interfaces: A molecular dynamicsstudy of virus-host receptor complexes

Priyanka Dutta, Mohsen Botlani and Sameer Varma

Department of Cell Biology, Microbiology and Molecular Biology, University of South Florida, 4202 E.Fowler Ave., Tampa, FL-33620, United States of America

Abstract

The dynamical properties of water at biological interfaces are different from those in bulk water. Experiments as well assimulations indicate that water diffuses and orients at rates that depend on both the chemistry as well as the topology of theinterface. Here we utilize molecular dynamics simulations to determine the nature and extent to which the dynamical proper-ties of water are shifted from their bulk values when they occupy interstitial regions between two proteins. We consider twonatural protein-protein complexes, one in which the Nipah virus G protein binds to cellular ephrin B2, and the other in whichthe same G protein binds to ephrin B3. These protein-protein interactions constitute the first step in Nipah infection. We findthat despite the low sequence identity of 50% between ephrins B2 and B3, the dynamical properties of interstitial waters in thetwo complexes are similar. In both cases, we find that the interstitial waters diffuse ten times slower compared to bulk water.In addition, despite their resolution in crystal structures, more than 95% of the waters in the interstitial regions exchangewith the bulk within 150 ns. The interstitial waters also exhibit dipole relaxation times and hydrogen bond lifetimes an orderin magnitude longer than bulk water. These deviations from bulk values are generally much larger than those observed atprotein-water interfaces. To gauge the functional relevance of the interstitial water, we examine quantitatively how implicitsolvent models compare against explicit solvent models in producing ephrin-induced shifts in the G configurational density.Ephrin-induced shifts in the G configurational density are critical to the allosteric regulation of viral fusion. We find that thetwo methods yield strikingly different induced changes in the G configurational density, which suggests that the interstitialwaters may also contribute to the allosteric signaling, and therefore, are functionally important.

Insert Received for publication Date and in final form Date.Correspondance: [email protected]

Introduction

The dynamical properties of water at biological interfaces are different from those in bulk water (? ? ? ? ? ? ? ? ? ? ? ). Howare they different?

In general, the fundamental trend observed from experiments and simulations is that water diffuses, relaxes and orientsslower at protein-water and lipid-water interfaces, as compared to in the bulk.

1. First hydration shell of proteins in denserIMPORTANT FOR METHODS: crystal WATERs in B2 and not B3. So retaining the crystal waters has not effect on the

overall properties.———— Equations:⇢/⇢0r (A)————Probing the folding and unfolding processes of proteins as a function of temperature is a major challenge in biophysics.

Here we examine the effects of temperature spikes that heat and cool proteins within tens of nanoseconds. Our results showthese spikes are capable of causing irreversible changes sufficient to eliminate protein activity.

© 2013 The Authors

0006-3495/08/09/2624/12 $2.00 doi: 10.1529/biophysj.106.090944

0.5

Interstitial water

(a) (b)

Biophysical Journal Volume: 00 Month Year 1–0 1

Water dynamics at protein-protein interfaces: A molecular dynamicsstudy of virus-host receptor complexes

Priyanka Dutta, Mohsen Botlani and Sameer Varma

Department of Cell Biology, Microbiology and Molecular Biology, University of South Florida, 4202 E.Fowler Ave., Tampa, FL-33620, United States of America

Abstract

The dynamical properties of water at biological interfaces are different from those in bulk water. Experiments as well assimulations indicate that water diffuses and orients at rates that depend on both the chemistry as well as the topology of theinterface. Here we utilize molecular dynamics simulations to determine the nature and extent to which the dynamical proper-ties of water are shifted from their bulk values when they occupy interstitial regions between two proteins. We consider twonatural protein-protein complexes, one in which the Nipah virus G protein binds to cellular ephrin B2, and the other in whichthe same G protein binds to ephrin B3. These protein-protein interactions constitute the first step in Nipah infection. We findthat despite the low sequence identity of 50% between ephrins B2 and B3, the dynamical properties of interstitial waters in thetwo complexes are similar. In both cases, we find that the interstitial waters diffuse ten times slower compared to bulk water.In addition, despite their resolution in crystal structures, more than 95% of the waters in the interstitial regions exchangewith the bulk within 150 ns. The interstitial waters also exhibit dipole relaxation times and hydrogen bond lifetimes an orderin magnitude longer than bulk water. These deviations from bulk values are generally much larger than those observed atprotein-water interfaces. To gauge the functional relevance of the interstitial water, we examine quantitatively how implicitsolvent models compare against explicit solvent models in producing ephrin-induced shifts in the G configurational density.Ephrin-induced shifts in the G configurational density are critical to the allosteric regulation of viral fusion. We find that thetwo methods yield strikingly different induced changes in the G configurational density, which suggests that the interstitialwaters may also contribute to the allosteric signaling, and therefore, are functionally important.

Insert Received for publication Date and in final form Date.Correspondance: [email protected]

Introduction

The dynamical properties of water at biological interfaces are different from those in bulk water (1–3, 6–13). How are theydifferent?

In general, the fundamental trend observed from experiments and simulations is that water diffuses, relaxes and orientsslower at protein-water and lipid-water interfaces, as compared to in the bulk.

1. First hydration shell of proteins in denserIMPORTANT FOR METHODS: crystal WATERs in B2 and not B3. So retaining the crystal waters has not effect on the

overall properties.———— Equations:⇢/⇢0r (A)r = 10 A————Probing the folding and unfolding processes of proteins as a function of temperature is a major challenge in biophysics.

Here we examine the effects of temperature spikes that heat and cool proteins within tens of nanoseconds. Our results showthese spikes are capable of causing irreversible changes sufficient to eliminate protein activity.

© 2013 The Authors

0006-3495/08/09/2624/12 $2.00 doi: 10.1529/biophysj.106.090944

G

B2

G

B2

0

0.2

0.4

0.6

0.8

1.0

0 0.2 0.4 0.6 0.8 1.0ηExplicit

η Implicit

ρ = 0.28

0

0.2

0.4

0.6

0.8

1.0

00.2

0.40.6

0.81.0

ηExplicit

ηImplicit

ρ=0.28

Inverse machine learning

Abstract Functional Regulation via small structural changes

(a) Comparison between two conformational ensembles

(b) Comparison between two conformational ensemble shifts

(c) Comparison between multiple conformational ensembles

References

an extended ensemble approach (38,39) and with a coupling constant of 1 ps.An extended ensemble approach is also used for maintaining pressure (40).Pressure is maintained at 1 bar using a coupling constant of 1 ps and acompressibility of 4:5! 10"5 bar"1. NaCl concentration is set at 150 mM,and there are extraNaþ ions compared toCl" ions tocompensate for the chargeon the protein. Electrostatic interactions are computed using the particle meshEwald scheme (41) with a Fourier grid spacing of 0.1 nm, a fourth-order inter-polation, and a direct space cutoff of 10 A. The van derWaals interactions arecomputed explicitly for interatomic distances%10 A. The bonds in proteinsand the geometries of water molecules are constrained (42,43), and conse-quently an integration time step of 2 fs is employed. The protein and ionsare described using OPLS-AA parameters (44), and the water molecules aredescribed using TIP4P parameters (45).We note thatwe do notmodel inducedeffects explicitly; however, such effects are generally more important fordescribing ionic interactions (46,47). Convergence is administered by trackingtime evolutions of conformational RMSDs, pressure, potential energies, and aset of collective variables that describe RBD-RBD interfaces.

Construction of RBD-RBD dimer models

While there are no experimental structures of the RBD-RBD dimer of Ni-pah G, there is sufficient experimental data to construct the initial dimermodel for carrying out MD simulations. Firstly, x-ray structures are avail-able for the isolated Nipah RBD as well as its complex with ephrin (25,26).Secondly, both the ephrin-free and ephrin-bound structures of Nipah RBDhave been subjected to MD at physiological temperature, and have beenfound to be stable (28,29). Thirdly, Bowden et al. (12) have proposed aRBD-RBD interface for the G protein of the Hendra virus (PDB: 2X9M).This interface serves as a suitable template to construct the initial modelof the RBD-RBD interface of Nipah G because 1) the G protein of Hendrais a closely related homolog of the Nipah G protein (89% sequence similar-ity; see Fig. S1 in the Supporting Material), and 2) x-ray structures of theephrin-free and ephrin-bound states of Hendra’s RBD closely match therespective x-ray structures of Nipah’s RBD (Fig. S2).

The RBD-RBD interface of Hendra’s G protein was proposed (12) byconsolidating data concerning the 1) packing interactions within crystals,2) conservation patterns within RBD-RBD interfaces of analogous receptorbinding proteins of other paramyxoviruses, and 3) distribution of N-linkedglycosylation sites on the RBD. In particular, the distribution of glycosyl-ation sites on the RBDs of Nipah and Hendra are such that they permitonly one specific face of the RBD to dimerize with an adjacent RBD—the remaining faces of the RBDs contain protruding glycosyl chains thatwill produce steric clashes. Therefore, there is absolutely no ambiguity con-cerning the dimerization face of the RBD. However, as Bowden et al. (12)also point out, there is ambiguity concerning the relative orientation be-tween the two RBDs. Nevertheless, the Nipah RBD-RBD model con-structed using Hendra template will serve as an excellent starting pointfor MD simulations, which we, as such, utilize to determine the relativeorientations between RBDs.

To construct the initial model of the RBD-RBD interface in the ephrin-free state, we take the final snapshot (640 ns) from our earlier simulationof the monomeric form of Nipah’s RBD (28), and geometrically fit twoof its copies individually onto the two RBDs of Hendra’s RBD-RBD dimer.Geometric fitting is conducted using the backbone Ca atoms. The two geo-metric fits produced identical least squared fit values, which are expectedbecause the RBD-RBD interface is symmetric. We also consider the fitsexcellent (RMSD < 2 A). The templated model is shown in Fig. S3. Weuse the same protocol to construct the initial model of the RBD-RBD dimerin the ephrin-bound state, but in this case we take the final snapshot (460 ns)of our simulation of Nipah’s ephrin-bound RBD monomer (28) (Fig. S3).Even in this case, we find that the geometric fits are excellent (RMSD <2 A). The reason that the structures of both the ephrin-free and ephrin-bound RBDs fit excellently on to the RBD of Hendra is because, as wenote in Fig. 2, the difference between the ephrin-free and ephrin-boundstructures of the RBD is small (25,26,28,29). Note that after fitting the

RBD of the ephrin-RBD complex to the RBD of the Hendra RBD-RBDtemplate, we apply the resulting rotational matrix to ephrin. Note alsothat we retain the water molecules sandwiched between ephrin and theRBD and apply the rotational matrix to these water molecules. We havefound these interstitial waters are critical to not only the structural integrityof the RBD-ephrin interface, but also to the inception of the ephrin bindingsignal at the RBD-ephrin interface (48). The two constructed RBD-RBD di-mers are energy minimized, solvated separately in salt solutions, and thensubjected to MD. The ephrin-free state is comprised of 356,770 particles,and the ephrin-bound state is comprised of 435,254 particles.

Comparison of conformational ensembles

The traditional approach to compare two conformational ensembles ofproteins, ℝ ¼ fr1; r2;.; rmg and ℝ0 ¼ fr01; r02;.; r0mg, where r denotes a3n-dimensional coordinate and m denotes the number of conformationsin the ensemble, is to compare their respective summary statistics, like cen-ters-of-mass (COMs) and root mean square fluctuations. However, if a sub-set of the summary statistics of the two ensembles is found to be identical, itdoes not imply that all of the 3n" 6 summary statistics of two ensembleswill also be identical. The general problem of finding and choosing afeature that appropriately distinguishes two ensembles can be overcomeby comparing ensembles directly against each other, and before any dimen-sionality reduction (28,29,48). A further advantage of comparing ensemblesdirectly against each other is that the resulting quantification naturally em-bodies differences in conformational fluctuations.

We compare ensembles directly against each other using amethodwe devel-oped recently (29). It quantifies the difference between two ensembles in termsof ametric, h, which satisfies two conditions: (1) hðℝ/ℝ0Þ ¼ hðℝ0/ℝÞ, and(2) if hðℝ/ℝ0Þ ¼ hðℝ0/ℝ00Þ, then it does not necessarily imply thathðℝ/ℝ0Þ ¼ hðℝ/ℝ00Þ. This metric is also universal in that it is not boundedby systemtype/size, and canbe used to examinedifferences in ensembles at anystructural hierarchy (functional groups, amino acids, or secondary structures).

Mathematically, h is a function of the geometrical overlap betweenconformational ensembles, ℝ and ℝ0,

h ¼ 1" kℝ X ℝ0 k : (1)

It is normalized, that is, h ˛ ½0; 1Þ, and it takes up a value closer to unity asthe difference between the ensembles increases. kℝ X ℝ0 k is estimatedby solving an inverse machine learning problem. In the traditional sense,machine learning is used for data classification (49–51)—the classificationfunction, or machine ðFðrÞÞ, is first trained on a set of instances withknown group identities, and then used for predicting the group identityof an unclassified instance. In principle, the conformational ensembles ℝand ℝ0 can also serve as training data to train a classification function,FðrÞ, which can, in turn, be used to predict whether an unseen conforma-tion belongs to ℝ or ℝ0. We have shown that if FðrÞ is constructed andtrained appropriately, then the overlap between ℝ and ℝ0 can be extractedfrom FðrÞ (28).

We have also demonstrated that this method works excellently andwithout need for any prior data fitting, provided we assume that the under-lying distributions are Gaussian—the mean absolute error (MAE) betweencomputed and analytical overlaps is 3.2% (29). The Gaussianity in a distri-bution, which is a corollary to the central limit theorem, is, however, a validassumption only in systems where particles do not interact with each other.Therefore, deviations can be expected for protein systems that evolve underthe influence of many-body interactions. Nevertheless, the overlap betweentwo multi-Gaussian distributions, ℝ ¼

Pcifi and ℝ0 ¼

Pc0if

0i , where fi are

Gaussians and ci are weighting coefficients, is essentially a sum of overlapsbetween Gaussian distributions, that is,

h ¼ 1" kX

i¼ 1

n

ci fi XX

j¼ 1

n

c0j f0j k¼ 1" k

X

i;j¼ 1

n

ci fi X c0j f0j k : (2)

Allosteric Stimulation of Nipah Host Binding Protein

Biophysical Journal 111, 1621–1630, October 18, 2016 1623

Quantifying Conformational Ensemble Changes in Proteins ...labs.cas.usf.edu/cbb/Papers/ISCB2016_poster_Mohsen.pdf · Mohsen Botlani, Ahnaf Siddiqui and Sameer Varma Department of

Documents