Accelerating inference for complex stochastic models using Approximate Bayesian Computation with an application to protein folding Umberto Picchini Centre for Mathematical Sciences, Lund University joint work with Julie Forman (Dept. Biostatistics, Copenhagen University) Dept. Mathematics, Uppsala 4 Sept. 2014 Umberto Picchini ([email protected])
51
Embed
Accelerated approximate Bayesian computation with applications to protein folding data
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Accelerating inference for complex stochasticmodels using Approximate Bayesian Computation
with an application to protein folding
Umberto PicchiniCentre for Mathematical Sciences,
Lund University
joint work withJulie Forman (Dept. Biostatistics, Copenhagen University)
This self-assembly process is called protein folding. It’s the last andcrucial step in the transformation of genetic information, encoded inDNA, into functional protein molecules.
protein folding is also associated with a wide range of humandiseases.In many neurodegenerative diseases, such as Alzheimers disease,proteins misfold into toxic protein structures.
Protein folding has been named “the Holy Grail of biochemistryand biophysics” (!).
Modelize time dynamics is difficult (large number of atoms in a3D space);Atoms coordinates are usually projected onto a single dimensioncalled reaction coordinate see the figure below
0 0.5 1 1.5 2 2.5
x 104
20
25
30
35
40
index
Figure: Data time-course projected on a single coordinate: 25,000 measurements of theL-reaction coordinate of the small Trp-zipper protein at sampling freq. ∆−1 = 1/nsec.
here the L-reaction coordinate was used, i.e. the total distance toa folded reference.notice the random switching between folded/unfolded states.
Modelize time dynamics is difficult (large number of atoms in a3D space);Atoms coordinates are usually projected onto a single dimensioncalled reaction coordinate see the figure below
0 0.5 1 1.5 2 2.5
x 104
20
25
30
35
40
index
Figure: Data time-course projected on a single coordinate: 25,000 measurements of theL-reaction coordinate of the small Trp-zipper protein at sampling freq. ∆−1 = 1/nsec.
here the L-reaction coordinate was used, i.e. the total distance toa folded reference.notice the random switching between folded/unfolded states.
A non-exhaustive list on the difficulty of using exact Bayesianinference via MCMC and SMC in our application:
our dataset is “large” (25,000 observations...)which is not “terribly” large in absolute sense, but it is whendealing with diffusion processes......in fact even when a proposed parameter value is in the bulk ofthe posterior distribution, generated trajectories might still be toodistant from the data (⇒ high rejection rate!)a high rejection rate implies poor exploration of the posteriorsurface, poor inferential results and increasing computationaltime.some of these issues can be mitigated using bridging techniques(Beskos et al. ’13): not trivial in our case (transformation τ(x)unknown in closed form).
Ideally we would like to use/sample from the posterior.We assume this is either theoretically difficult or computationallyexpensive (it is in our case!).
ABC gives a way to approximate a posterior distribution π(η|z)
key to the success of ABC is the ability to bypass the explicitcalculation of the likelihood p(z|η)...only forward-simulation from the model is required!
ABC is in fact a likelihood-free method that works by simulatingpseudo-data zsim from the model:
zsim ∼ p(z|η)
had an incredible success in genetic studies since mid 90’s(Tavare et al ’97, Pritchard et al. ’99).lots of hype in recent years: see Christian Robert’s excellentblog.
Previous algorithm samples from approximate posteriorsπ(η| ‖ z − zsim ‖< δ) or π(η| ‖ S(z) − S(zsim) ‖< δ.
useful to consider statistics S(·) when dealing with large datasetsto increase the probability of acceptance.
the key result of ABC
When S(·) is “sufficient” for η and δ ≈ 0 sampling from the posterioris (almost) exact!
When S(·) is sufficient for the parameter⇒ π(η| ‖ S(zsim) − S(z) ‖< δ) ≡ π(η| ‖ zsim − z ‖< δ)when δ = 0 and S sufficient we accept only parameter draws forwhich zsim ≡ z⇒ π(η|z), the exact posterior.
This is all good and nice, but such conditions rarely hold.
Previous algorithm samples from approximate posteriorsπ(η| ‖ z − zsim ‖< δ) or π(η| ‖ S(z) − S(zsim) ‖< δ.
useful to consider statistics S(·) when dealing with large datasetsto increase the probability of acceptance.
the key result of ABC
When S(·) is “sufficient” for η and δ ≈ 0 sampling from the posterioris (almost) exact!
When S(·) is sufficient for the parameter⇒ π(η| ‖ S(zsim) − S(z) ‖< δ) ≡ π(η| ‖ zsim − z ‖< δ)when δ = 0 and S sufficient we accept only parameter draws forwhich zsim ≡ z⇒ π(η|z), the exact posterior.
This is all good and nice, but such conditions rarely hold.
a central problem is how to choose the statistics S(·): outside theexponential family we typically cannot derive sufficient statistics.[A key work to obtain “semi-automatically” statistics is Fearnhead-Prangle ’12
(discussion paper on JRSS-B. Very much recommended.)]
substitute with the loose concept of informative (enough)statistic then choose a small (enough) threshold δ.
We now go back to our model and (large) data. We propose sometrick to accelerate the inference.
We will use an ABC within MCMC approach (ABC-MCMC).
a central problem is how to choose the statistics S(·): outside theexponential family we typically cannot derive sufficient statistics.[A key work to obtain “semi-automatically” statistics is Fearnhead-Prangle ’12
(discussion paper on JRSS-B. Very much recommended.)]
substitute with the loose concept of informative (enough)statistic then choose a small (enough) threshold δ.
We now go back to our model and (large) data. We propose sometrick to accelerate the inference.
We will use an ABC within MCMC approach (ABC-MCMC).
ABC-MCMC acceptance ratio: Given the current value of parameterη ≡ ηold generate a Markov chain via Metropolis-Hastings:
Algorithm 1 a generic iteration of ABC-MCMC (fixed threshold δ)At r-th iteration1. generate η ′ ∼ u(η ′|ηold), e.g. using Gaussian random walk2. generate zsim|η
′ ∼ “from the model – forward simulation”3. generateω ∼ U(0, 1)
4. accept η ′ if ω < min(
1, π(η′)K(S(zsim),S(z))u(ηold|η
′)π(ηold)K(S(zold),S(z))u(η ′|ηold)
)then set ηr = η
′ else ηr = ηold
Samples are from π(η, zsim| ‖ S(zsim) − S(z) ‖< δ).Umberto Picchini ([email protected])
Algorithm 2 a generic iteration of ABC-MCMC (random threshold δ)At r-th iteration1. generate η ′ ∼ u(η ′|ηold), δ ′ ∼ v(δ|δold)2. generate zsim|η
′ ∼ “from the model – forward simulation”3. generateω ∼ U(0, 1)
Top row: full dataset of 28,000 observations.Bottom row: every 30th observation is reported.
0 0.5 1 1.5 2 2.5
x 104
20
25
30
35
40
index22 24 26 28 30 32 34 36 38 400
500
1000
1500
2000
2500
data
0 100 200 300 400 500 600 700 800 90022
24
26
28
30
32
34
36
index22 24 26 28 30 32 34 36
0
10
20
30
40
50
60
70
data
Dataset is 30 times smaller but qualitative features are still there!Umberto Picchini ([email protected])
Strategy for large datasets (Picchini-Forman ’14)
We have a dataset z of about 25, 000 observations.prior to starting ABC-MCMC construct S(z) to contain:
1 the 15th-30th...-90th percentile of the marginal distribution of thefull data. → to identify Gauss. mixture params µ1,µ2 etc
2 values of autocorrelation function of full data z at lags(60, 300, 600, ..., 2100). → to identify dynamics-related paramsθ,γ, κ
during ABC-MCMC we simulate shorter trajectories zsim of size25000/30 ≈ 800.we take as summary statistics S(zsim) the 15th-30th...-90thpercentile of simulated data and autocorrelations at lags(2, 10, 20, ..., 70) (recall zsim is 30x shorter than z).we then compare S(zsim) with S(z) into ABC-MCMC.this is fast! S(·) for the large data can be computed prior toABC-MCMC start.
We have a dataset z of about 25, 000 observations.prior to starting ABC-MCMC construct S(z) to contain:
1 the 15th-30th...-90th percentile of the marginal distribution of thefull data. → to identify Gauss. mixture params µ1,µ2 etc
2 values of autocorrelation function of full data z at lags(60, 300, 600, ..., 2100). → to identify dynamics-related paramsθ,γ, κ
during ABC-MCMC we simulate shorter trajectories zsim of size25000/30 ≈ 800.we take as summary statistics S(zsim) the 15th-30th...-90thpercentile of simulated data and autocorrelations at lags(2, 10, 20, ..., 70) (recall zsim is 30x shorter than z).we then compare S(zsim) with S(z) into ABC-MCMC.this is fast! S(·) for the large data can be computed prior toABC-MCMC start.
We have a dataset z of about 25, 000 observations.prior to starting ABC-MCMC construct S(z) to contain:
1 the 15th-30th...-90th percentile of the marginal distribution of thefull data. → to identify Gauss. mixture params µ1,µ2 etc
2 values of autocorrelation function of full data z at lags(60, 300, 600, ..., 2100). → to identify dynamics-related paramsθ,γ, κ
during ABC-MCMC we simulate shorter trajectories zsim of size25000/30 ≈ 800.we take as summary statistics S(zsim) the 15th-30th...-90thpercentile of simulated data and autocorrelations at lags(2, 10, 20, ..., 70) (recall zsim is 30x shorter than z).we then compare S(zsim) with S(z) into ABC-MCMC.this is fast! S(·) for the large data can be computed prior toABC-MCMC start.
Figure: Data (top), process Yt (middle), process Zt (bottom).
Zt = Yt + Ut meaning “evaluated at η = posterior mean”Umberto Picchini ([email protected])
A simulation study (Picchini-Forman ’14)
Here we want to compare ABC against the (computationallyintensive) exact Bayesian inference (via particle MCMC,pMCMC).
in order to do so we consider a very small dataset of 360simulated observations.
we use a parallel strategy for pMCMC devised in Dovrandi ’14(4 chains run in parallel using 100 particles for each chain).
C. Dovrandi (2014) Pseudo-marginal algorithms with multiple CPUs.Queensland University of Technology, http://eprints.qut.edu.au/61505/
U.P. and Forman (2014). Accelerating inference for diffusions observed with measurement errorand large sample sizes using Approximate Bayesian Computation. arXiv:1310.0973.
as long as we manage to “compress” information into summarystatistics ABC is a useful inferencial tool for complex modelsand large datasets.
1,000 ABC-MCMC iterations performed in 6 sec; in about 20min with exact Bayesian sampling (pMCMC).
...problem is that ABC requires lots of tuning (choose S(·), δ,K(·)...)MATLAB implementation available athttp://sourceforge.net/projects/abc-sde/with 50+ pages manual.
U.P. (2014). Inference for SDE models via ApproximateBayesian Computation. J. Comp. Graph. Stat.
U.P. and J. Forman (2013). Accelerating inference for diffusionsobserved with measurement error and large sample sizes usingApproximate Bayesian Computation. arXiv:1310.0973.
U.P. (2013) abc-sde: a MATLAB toolbox for approximateBayesian computation (ABC) in stochastic differential equationmodels.http://sourceforge.net/projects/abc-sde/
The proof is straightforward.We know that a draw (η ′, zsim) produced by the algorithm is such that(i) η ′ ∼ π(η), and (ii) such that zsim = z, where zsim ∼ π(zsim | η ′).
Thus let’s call f (η ′) the (unknown) density for such η ′, then becauseof (i) and (ii)
Suppose at a given iteration of Metropolis-Hastings we are in the(augmented)-state position (θ#, x#) and wonder whether to move (ornot) to a new state (θ ′, x ′). The move is generated via a proposaldistribution “q((θ#, x#)→ (x ′, θ ′))”.
e.g. “q((θ#, x#)→ (x ′, θ ′))” = u(θ ′|θ#)v(x ′ | θ ′);move “(θ#, x#)→ (θ ′, x ′)” accepted with probability
Suppose at a given iteration of Metropolis-Hastings we are in the(augmented)-state position (θ#, x#) and wonder whether to move (ornot) to a new state (θ ′, x ′). The move is generated via a proposaldistribution “q((θ#, x#)→ (x ′, θ ′))”.
e.g. “q((θ#, x#)→ (x ′, θ ′))” = u(θ ′|θ#)v(x ′ | θ ′);move “(θ#, x#)→ (θ ′, x ′)” accepted with probability
Suppose at a given iteration of Metropolis-Hastings we are in the(augmented)-state position (θ#, x#) and wonder whether to move (ornot) to a new state (θ ′, x ′). The move is generated via a proposaldistribution “q((θ#, x#)→ (x ′, θ ′))”.
e.g. “q((θ#, x#)→ (x ′, θ ′))” = u(θ ′|θ#)v(x ′ | θ ′);move “(θ#, x#)→ (θ ′, x ′)” accepted with probability
HOWTO: post-hoc selection of δ (the “precision”parameter) [Bortot et al. 2007]
During ABC-MCMC we let δ vary (according to a MRW): at rth iterationδr = δr−1 + ∆, with ∆ ∼ N(0,ν2).After the end of the MCMC we have a sequence {θr, δr}r=0,1,2... and for eachparameter {θj,r}r=0,1,2... we produce a plot of the parameter chain vs δ:
we filter out of the analyses those draws {θr}r=0,1,2,...corresponding to “large” δ, for statistical precision;we retain only those {θr}r=0,1,2,... corresponding to a low δ.in the example we retain {θr; δr < 1.5}.PRO: this is useful as it allows an ex-post selection of δ, i.e. wedo not need to know in advance a suitable value for δ.CON: by filtering out some of the draws, a disadvantage of theapproach is the need to run very long MCMC simulations inorder to have enough “material” on which to base our posteriorinference.PRO: also notice that by letting δ vary we are almost consideringa global optimization method (similar to simulated tempering).