Non-parametric Approximate Bayesian Computation for Expensive Simulators Steven Laan 6036031 Master’s thesis 42 EC Master’s programme Artificial Intelligence University of Amsterdam Supervisor Ted Meeds A thesis submitted in conformity with the requirements for the degree of MSc. in Artificial Intelligence August 26, 2014
67
Embed
Non-parametric Approximate Bayesian Computation for ... · Non-parametric Approximate Bayesian Computation for Expensive Simulators Steven Laan 6036031 Master’s thesis 42 EC Master’s
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Non-parametric Approximate BayesianComputation for Expensive Simulators
Steven Laan6036031
Master’s thesis42 EC
Master’s programme Artificial Intelligence
University of Amsterdam
SupervisorTed Meeds
A thesis submitted in conformity with the requirements for the degree ofMSc. in Artificial Intelligence
August 26, 2014
Acknowledgements
I would like to express my gratitude to my supervisor Ted Meeds for his
patience and advice every time I stumbled into his office. Furthermore I
would like to thank Max Welling for his helpful suggestions and remarks
along the way.
I would like to thank my thesis committee Henk, Max and Ted for agreeing
on a defence date on a rather short notice. I would like to thank Henk Zeevat
in particular, for chairing the committee just after his vacation.
Finally, I would like to thank my girlfriend and family for their endless
support.
i
Abstract
This study investigates new methods for Approximate Bayesian Computa-
tion (ABC). The main goal is to create ABC methods that are easy to use
by researchers in other fields and perform well with expensive simulators.
Kernel ABC methods have been used to approximate non-Gaussian densi-
ties, however, these methods are inefficient in use of simulations in two ways.
First, they do not assess the required number of simulations. For some pa-
rameter settings only a small number of simulations is needed. Second, all
known methods throw away the entire history of simulations, ignoring the
valuable information it holds.
This thesis addresses these problems by introducing three new algorithms:
Adaptive KDE Likelihood ABC (AKL-ABC), Projected Synthetic Surrogate
ABC (PSS-ABC) and Projected KDE Surrogate ABC (PKS-ABC). The first
one, AKL-ABC, performs a kernel density estimation at each parameter lo-
cation and adaptively chooses the number of simulations at each location.
The other two algorithms take advantage of the simulation history to
estimate the simulator outcomes at the current parameter location. The
difference between them is that PSS-ABC assumes a Gaussian conditional
probability and hence is a parametric method, whereas PKS-ABC is non-
parametric by performing a kernel density estimate.
Experiments demonstrate the advantages of these algorithms in particular
with respect to the number of simulations needed. Additionally, the flexibility
of non-parametric methods is illustrated.
ii
List of Abbreviations
ABC Approximate Bayesian Computation
MCMC Markov chain Monte Carlo
KDE Kernel density estimate
GP Gaussian process
MH Metropolis-Hastings
LOWESS Locally weighted linear regression
(A)SL (Adaptive) Synthetic Likelihood
(A)KL (Adaptive) KDE Likelihood
PSS Projected Synthetic Surrogate
PKS Projected KDE Surrogate
List of Symbols
θ Vector of parameters
y? Vector of observed values
E(·) Expected value of a random variable
U(a, b) Uniform distribution on the interval a, b
N (µ,Σ) Normal distribution with mean vector µ
and covariance matrix Σ
Gam(α, β) Gamma distribution with shape α and rate β
kh(·) A kernel function with bandwidth h
π(x | y) Probability of x given y
KDE(x? | k, h,w,X) The kernel density estimate at location x?,
using kernel k with bandwidth h and data points X,
The pseudo-code for the resulting Synthetic Likelihood ABC algorithm is
given in appendix B
2.2.3 Adaptive Synthetic Likelihood
One downside to Synthetic Likelihood ABC (SL-ABC) is that there is no
parameter to set the accuracy of the method. There is of course the number
of simulations at each location, but this number gives no guarantee in terms
of any error criterion. Moreover, at some parameter locations it may be more
prone to making errors in accepting/rejecting than others. Hence, sampling
the same number of times at each location may not be optimal.
Therefore, instead of retrieving an equal amount of samples at each loca-
tion the idea is to initially call the simulator S0 times. Additional simulations
are only performed when needed. More formally: keep track of the probabil-
ity of making a Metropolis-Hastings error in accepting/rejecting the sample.
When this error is larger than some threshold value ξ, more simulations are
10
required.
The first example of an adaptive method is the Adaptive Synthetic Like-
lihood algorithm (ASL-ABC) introduced by Meeds and Welling [25]. To be
able to estimate the probability of an acceptance error, an estimate of the
acceptance distribution is needed. This can be done by creating M sets of S
samples and calculating the acceptance probabilities for each set. To obtain
the samples, the simulator could be called S · (M − 1) times extra. However,
because the synthetic likelihood assumes a normal distribution for each slice,
we can derive that the variance of the mean is proportional to 1/S. Thus
instead of calling the simulator, sample means can be drawn from a normal
distribution .
Each sample mean µm can be used to calculate one acceptance probability
αm using equation (10). With these probabilities αm the acceptance error
can be computed. There are two possibilities for making an error:
1. Reject, while we should accept.
2. Accept, while we should reject.
Hence, the total error, conditioned on the uniform sample u ∼ U(0, 1), be-
comes:
Eu(α) =
1M
∑m [αm < u] if u ≤ τ
1M
∑m [αm ≥ u] if u > τ
(11)
Where τ is the decision threshold. The total unconditional error can then be
obtained by integrating out u:
E(α) =
∫Eu(α) U(0, 1) du (12)
This can be done by Monte Carlo integration or using grid values for u.
Equation (12) is known as the mean absolute deviation of p(α), which is
minimized for τ = median(α) [25]. Hence, in the algorithm the decision
threshold is set to median of the α samples. The pseudocode for the algorithm
is given in appendix B.
11
2.3 KDE Likelihood ABC
The SL-ABC and ASL-ABC algorithms assume that the likelihood function
at each parameter location is a Gaussian, which may not be the case. If
the underlying density has for example multiple modes or a heavy tail the
resulting Gaussian fit can be very poor.
Hence, instead of assuming a certain form of the likelihood, a non-parametric
estimate can be advantageous in certain scenarios [27, 41, 44]. Moreover,
Turner [41] shows that when using a kernel density estimate (KDE) there is
no need for sufficient statistics.
The acceptance probability when using a KDE becomes:
α = min
(1,
∑s kh(y
? − x′s) π(θ′) q(θ | θ′)∑s kh(y
? − xs) π(θ) q(θ′ | θ)
)(13)
Where kh(·) is a kernel function with bandwidth h. The resulting Metropolis-
Hastings step is shown in algorithm 3. Note that at each parameter loca-
tion the bandwidth is re-estimated. The algorithm performs a simulation at
both the current and the proposed location each iteration. Hence this is the
marginal version of the algorithm.
Algorithm 3 The MH-step for KL-ABC.
1: procedure KL-ABC MH-step(q,θ, π(θ), π(x | θ),y?, S,∆S)2: θ′ ∼ q(θ′ | θ)3: for s← 1 to S do4: xs
sim∼ π(x | θ)
5: x′ssim∼ π(x | θ′)
6: end for7: Set α using equation (13)8: if U(0, 1) ≤ α then9: return θ′
10: end if11: return θ12: end procedure
12
2.4 Population Methods
Instead of working with one sample of θ at a time, it is also possible to keep a
population of samples, sometimes called particles. These are the population
Monte Carlo (PMC) and sequential Monte Carlo (SMC) approaches and have
the advantage that the resulting samples are individually independent. This
is in contrast to single chain ABC-MCMC methods, where often thinning is
used to obtain more independent samples.1
Another advantage of population methods is that they are quite easily
parallelized and therefore may result in speedups [13].
A disadvantage of population methods is that they require more simu-
lation calls. For each particle at each iteration a simulation call is needed.
Therefore if you have N particles, population methods need N simulation
calls more per iteration compared to the single chain algorithms.
For costly simulators the number of simulation samples needs to be mini-
mized. For this reason, we will only focus on single chain methods. It should
be noted that the ideas behind the proposed algorithms could also be viable
for PMC or SMC algorithms.
2.5 Surrogate Methods
For fast simulators, local methods such as SL-ABC and marginal-ABC are
perfectly fine. If however it is costly to simulate, it is desirable to incorpo-
rate the information that you gained by simulations at all previous locations.
Therefore the question becomes: How can the information of the entire sim-
ulation history be used in the estimate for the current location?
The idea is to aggregate all information of the samples in a surrogate
function. This can be done in two ways: either you model the simulator or
you model the likelihood function with the surrogate. Both approaches will
be described in the next sections.
Surrogate methods have been implemented in different research areas
before [20, 21,32], but are relatively new in the field of ABC [25,45].
1Thinning is the process of only keeping every Nth sample. However, to keep a rea-sonable number of samples the chain needs to be run for a longer time.
13
2.5.1 Simulator Surrogate ABC
In the first case, the surrogate function tries to model the simulator, but it
should be computationally cheaper to run. Instead of calling the simulator
for the current θ location directly, first the surrogate function is queried
about its estimate. If the surrogate has small uncertainty about the queried
value, the value returned by the surrogate is treated like it was from the real
simulator. Otherwise, if the uncertainty is too high, additional samples are
obtained from the simulator. After enough samples, the surrogate should be
certain enough in all locations and hence not require any additional simulator
calls.
The first surrogate method of this kind is the Gaussian Process Surrogate
(GPS-ABC) by Meeds and Welling [25]. As the name suggests, they fitted
a Gaussian process to the acquired samples as a surrogate. They showed
that a huge reduction in simulator calls is possible, while retaining similar
performance.
A nice property of modelling the simulator is that previously acquired
samples can be used to train the model. Moreover, the results of different
runs of the algorithm can be merged to obtain a better approximation.
While Gaussian processes are a potent candidate for surrogate functions
there are a couple of problems that surface in practice. The computational
complexity of GPS-ABC is high. Moreover, to be able to run the algorithm,
the hyperparameters of the Gaussian process need to be set, which can be
difficult to elicit beforehand.
A downside of GPS-ABC is that for simulators with J output dimensions
the algorithm models J independent GPs instead of one J-dimensional out-
put. Hence the different outputs are assumed to be independent, which may
not be the case.
For multimodal and other non-Gaussian conditional probabilities, the
Gaussian assumption of GPS-ABC is incorrect. Which leads to incorrect
posterior distributions.
14
2.5.2 Likelihood Surrogate ABC
Instead of modelling the simulator, the (log-)likelihood surface can be mod-
elled. This is much easier than modelling a high-dimensional simulator, as
the likelihood is a one dimensional function of θ.
When a decent model of the likelihood surface has been obtained, regions
of unlikely space can be ruled out. That is, no more samples should be
accepted in regions of low likelihood.
Wilkinson first proposed this approach using Gaussian Processes [45].
To discard areas of low likelihood he employs a technique called sequential
history matching [9]. This is a hierarchical approach to rule out regions of
space. At each step a new model is built, modelling only the region that
was labelled plausible by the previous models. This way, each model gives a
more detailed view of the log-likelihood surface.
An advantage of this approach over the GPS-ABC algorithm that models
the simulator, is that only a single Gaussian process is fitted, whereas the
GPS-ABC has a GP for each dimension of the output. On the other hand,
each sequential history step a new GP is fit and hence needs to be tuned.
A downside of this approach is that the resulting likelihood surface is
tightly connected to the parameter settings for the experiment. For example
changing the error criterion ε changes the entire likelihood surface and hence
the results for one experiment are not directly useful for another.
15
3 Kernel ABC Methods
In this section we propose three new algorithms. However, first we give
short descriptions of both the existing algorithms and the newly proposed
algorithms. A table that provides an overview of the different properties of
the various algorithm is also presented.
3.1 Overview of ABC methods
A short description of each algorithm, where the ones denoted with an as-
terisk (*) will be proposed later in this thesis:
Synthethic Likelihood assumes a Gaussian conditional distribution, which
is estimated using a fixed number of local samples.
Adaptive Synthetic Likelihood assumes a Gaussian conditional distri-
bution, which is estimated using an adaptively chosen number of local
samples.
KDE Likelihood approximates the conditional distribution using a KDE
based on a fixed number of local samples.
Adaptive KDE Likelihood* approximates the conditional distribution us-
ing a KDE based on an adaptively chosen number of local samples.
Projected Synthetic Surrogate* assumes a Gaussian conditional distri-
bution. It is estimated using a weighted estimate of mean and variance
based on all samples from the simulation history, which are projected
onto the current parameter location.
Projected KDE Surrogate* approximates the conditional distribution us-
ing a weighted KDE of projected samples from the simulation history.
There are different properties that ABC algorithms can have. We put the
existing algorithms as well as the ones that will be proposed later in table 1.
For reading convenience the different abbreviations are stated in the “Abbre-
viation” column. The column labelled “Local” states whether the algorithm
16
is local or global in nature. Local algorithms only use samples on the param-
eter location, whereas global methods incorporate the samples from other
locations as well. The “Parametric” column states whether the algorithm
makes any assumptions about underlying structures. Finally, whether the
algorithm adapts the number of samples to the parameter location is stated
in the last column labelled “Adaptive”.
Algorithm Abbreviation Local Parametric Adaptive
Synthetic Likelihood SL Yes Yes No
Adaptive Synthetic Likelihood ASL Yes Yes Yes
KDE Likelihood KL Yes No No
Adaptive KDE Likelihood AKL Yes No Yes
Projected Synthetic Surrogate PSS No Yes Yes
Projected KDE Surrogate PKS No No Yes
Table 1: Different ABC algorithms and their properties.
3.2 Adaptive KDE Likelihood ABC
In the same way the adaptive synthetic likelihood differs from the original
synthetic likelihood, we built an adaptive version of the KDE Likelihood
algorithm.
As before, a distribution over acceptance probabilities is needed. Hence
we need the variance of the estimator.
One solution for this problem is to just get additional sets of S simula-
tions, but this is too costly and other methods are preferred.
The first solution is to use well-known asymptotic results for the variance
of KDE. The approximation for the variance is [14]:
σ(x) =
√π(x)
Nh
∫k(u)2 du (14)
Where π(x) is the KDE at location x, N is the number of training points,
h is the bandwidth and the integral is known as the roughness of the kernel,
which can be computed analytically for the commonly used kernels.
17
When the variance has been computed, several samples from the normal
distribution N (µ(x), σ(x)) can be drawn to create an acceptance distribu-
tion, where µ(x) = π(x) is the KDE at location x.
There are however two problems with this approach. The asymptotic
theory from which equation (??) is derived, is based on the scenario that
N →∞, whereas we are in a regime where we specifically want to limit the
number of simulation calls. Hence equation (14) is only valid for large values
of N .
The other problem is that we are now introducing a parametric estimate
into the non-parametric algorithm. The goal was to have as few assumptions
as possible, so any methods that do not assume a Gaussian likelihood are
preferred.
One such method is bootstrapping. A bootstrap sample of the simulated
values is taken and a new kernel density estimate is performed. This new
KDE yields another acceptance probability. Hence, this bootstrap process
can be repeated M times to obtain a distribution over acceptance probabili-
ties. This is the approach that is used in our new adaptive KDE Likelihood
(AKL-ABC) algorithm.
The pseudocode of this AKL-ABC is very similar to that of ASL-ABC.
The only difference is how the acceptance probabilities are estimated. The
pseudocode is shown in algorithm 4.
3.3 Kernel Surrogate Methods
We propose two new surrogate methods that aim to address the shortcomings
of the Gaussian process surrogate method described in section 2.5.1.
The reason we chose to model the simulator, instead of the likelihood
function, is that when modelling the simulator results for one parameter
setting of the algorithm can be reused for a different setting, as the mod-
elled simulator remains the same. This property is especially desirable when
working with simulators that are expensive.
When modelling the log-likelihood this is not necessarily the case, as the
likelihood surface is tightly connected to for example the ε parameter.
9: end for10: S ← S + c11: c← ∆S12: for m← 1 to M do13: Obtain bootstrap samples of x1, . . . ,xS and x′1, . . . ,x
′S
14: Set αm using equation (13)15: end for16: τ ← median(α)17: Set E(α) using equation (12)18: until E(α) ≤ ξ19: if U(0, 1) ≤ τ then20: return θ′
21: end if22: return θ23: end procedure
3.3.1 Projected Synthetic Surrogate
The GPS-ABC algorithm uses a Gaussian process as a surrogate function.
This GP provides for each parameter setting a mean and standard deviation.
The main idea behind Projected Synthetic Surrogate ABC (PSS-ABC) is
that we can compute these statistics using kernel regression.
The computation of the mean and standard deviation using kernel regres-
sion is done using equations (15) and (16):
f(θ?) = µ(θ?) =
∑Nn=1 kh(θn − θ?) · yn∑Nn=1 kh(θn − θ?)
(15)
19
θ
y
OrthogonalLinear correctedTrue density
θy
OrthogonalLinear correctedTrue density
Figure 1: Linear corrected projection versus orthogonal projection with dif-ferent numbers of training points.
σ2(θ?) =
∑Nn=1 kh(θn − θ?) · (yn − µ(θ?))2∑N
n=1 kh(θn − θ?)(16)
Where θ? is the proposed parameter location, θn is the nth parameter loca-
tion from the simulation history with its corresponding simulator output yn.
The derivation of equation (15) can be found in appendix A.4.
Kernel regression is a local constant estimator. This can be viewed as
an orthogonal projection of neighbouring points onto the vertical line at
the parameter location θ. A weighted average of the projected points is
calculated using the kernel weights. This is the kernel regression estimate
f(θ) for that location. However, projecting orthogonally can lead to overly
dispersed estimates, which is illustrated in figure 1.
Thus instead of always projecting orthogonally, we first perform a local
linear regression [7] and then project along this line, or hyperplane in higher
dimensions.
This idea of linear correction has been used in the ABC framework before
[4,5]. The difference is that up to now it is used it as a post processing step,
20
whereas here it is an integral part of the algorithm. Moreover, because it is
a post processing step, they project onto the θ axis, whereas we project on
a line perpendicular to that.
The locally weighted linear regression (LOWESS) [7] assumes the follow-
ing regression model:
y = Θβ + ε (17)
Where ε is a vector of zero mean Gaussian noise, Θ the N by D design matrix
consisting of N data points θn with dimension D and β is the D-dimensional
vector of regression coefficients. Note that we set θ0 = 1 and hence β0 is the
intercept.
To compute the estimated value y for a proposed location θ? locally
weighted regression first computes a kernel weight wn for each data point θn:
wn = kh(θn − θ?) (18)
Cleveland suggests to use the tricube kernel for these weights [7]. Then the
following system of equations needs to be solved to obtain the regression
coefficients:
∑w
∑wΘ1 · · · ∑
wΘD∑wΘ1
∑wΘ2
1 · · · ∑wΘ1ΘD
......
. . ....∑
wΘD
∑wΘDΘ1 · · ·
∑wΘ2
D
· β =
∑y∑
ywΘ1
...∑ywΘD
(19)
Where w is the vector of kernel weights for the N training points and Θd
denotes the dth column vector of Θ containing the dth entry of every training
point, i.e. Θ2 = [Θ12,Θ22, . . .ΘN2].
The resulting solution2 for β is the vector of regression coefficients for
the linear equation that describes the trend at location θ?. Therefore, this
β provides a hyperplane along which the data can be projected.
2Which can easily be solved by most linear algebra packages.
21
When β is the zero-vector, you are projecting along a flat hyperplane and
orthogonally on the θ? slice.
There are border cases where linear correction can get overconfident. For
example when there are few samples near the proposed location. In the
extreme case, this means the regression will only be based on two samples
and in effect the regression is the line through these points, which can be
very different than the actual trend going on.
To overcome this problem, the algorithm needs to recognise these situa-
tions or assess the uncertainty in these situations. An elegant solution is to
use smoothed bootstrapping [11]. Smoothed bootstrapping resamples from
a set of samples, but unlike ordinary bootstrapping it adds (Gaussian) noise
to the newly obtained samples. The variance of this noise depends on the
size of the set of original samples, usually σ = 1/√N . As a result there
is more variance in resampled values when there are few samples and hence
more uncertainty in the local regression. Because of this noise, the computed
hyperplanes will vary much more with a small pool of samples.
The pseudocode for PSS-ABC is shown in algorithm 5. Note that a diago-
nal covariance matrix is computed, which is equivalent to assuming indepen-
dence between the output variables. This is because a full rank covariance
matrix is much harder to estimate with few points.
The acquisition of new training points at line 21 is at either the current
parameter location θ or the proposed parameter location θ′, each with 0.5
probability. We note that more sophisticated methods could be implemented,
but this simple procedure worked fine in our experiments.
3.4 Projected Kernel Density Estimated Surrogate
Instead of assuming that the conditional distribution π(y | θ) is Gaussian,
this conditional distribution can also be approximated using a kernel density
estimate.
As with PSS-ABC, the points from the simulation history are projected
onto the θ? slice. Then, instead of computing the weighted mean and variance
of a Gaussian, a weighted kernel density estimate is performed.
22
Algorithm 5 The MH-step for PSS-ABC.
1: procedure PSS-ABC MH-step(q,θ, π(θ), π(x | θ),y?,∆S,M, ξ)2: θ′ ∼ q(θ′,θ)3: repeat4: Compute weights using equation (18)5: for m← 1 to M do6: Get smoothed bootstrap samples of y7: for j ← 1 to J do8: Compute βj and β′j using a local regression at θ and θ′
9: Project bootstrapped yj along hyperplane βj10: Compute µj, σj using equations (15) and (16)11: Project bootstrapped yj along hyperplane β′j12: Compute µ′j, σ
16: Set αm using equation (10)17: end for18: τ ← median(α)19: Set E(α) using equation (12)20: if E(α) > ξ then21: Acquire ∆S new training points22: end if23: until E(α) ≤ ξ24: if U(0, 1) ≤ τ then25: return θ′
26: end if27: return θ28: end procedure
23
The resulting MH-step is shown in algorithm 6.
Algorithm 6 The MH-step for PKS-ABC.
1: procedure PSS-ABC MH-step(q,θ,∆S,M, ξ,y?)2: θ′ ∼ q(θ′ | θ)3: repeat4: Compute weights using equation (18)5: for m← 1 to M do6: Get smoothed bootstrap samples of y7: p← 1, p′ ← 18: for j ← 1 to J do9: Compute βj and β′j using a local regression at θ and θ′
10: Zj ← projection of bootstrapped yj along hyperplane βj11: Z′j ← projection of bootstrapped yj along hyperplane β′j12: p← p ·KDE(y?j | k, h,Zj)13: p′ ← p′ ·KDE(y?j | k, h,Z′j)14: end for15: αm ← min
(1, q(θ|θ
′)π(θ′)p′
q(θ′|θ)π(θ)p
)16: end for17: τ ← median(α)18: Set E(α) using equation (12)19: if E(α) > ξ then20: Acquire ∆S new training points21: end if22: until E(α) ≤ ξ23: if U(0, 1) ≤ τ then24: return θ′
25: end if26: return θ27: end procedure
3.5 Ease of Use
The proposed methods aim to be as simple in use as possible. In contrast
to Gaussian processes, there is no difficult task to tune hyperparameters,
such as the length scales and covariance function. Moreover, GPs have to
be monitored for degeneracies over time, which means that the algorithm
cannot be run as a black box.
24
The kernel methods described earlier, do not have these problems; they
are nearly plug-and-play. The kernels and the bandwidth selection method
are the only parameters that need to be set. A couple of motivations are
made as to how to set these parameters in appendix C. We suggest to set the
kernel in the y direction to a kernel with infinite support, such as the Gaus-
sian kernel. For the bandwidth selection it is known that for non-Gaussian
densities Silverman’s rule of thumb overestimates the bandwidth. Hence
better performance may be achieved by consistently dividing the computed
bandwidth by a fixed value.
25
4 Experiments
We perform three sets of experiments:
1. Exponential problem: a toy Bayesian inference problem.
2. Multimodal problem: a toy Bayesian inference problem, with multiple
modes at some parameter locations.
3. Blowfly problem: inference of parameters in a chaotic ecological system.
The first two are mathematical toy problems to show correctness of our
proposed algorithms. The goal of the second problem is to illustrate differ-
ences between parametric and non-parametric ABC methods.
The last experiment is more challenging and allows a view of the perfor-
mance in a more realistic setting.
4.1 Exponential Toy Problem
The exponential problem is to estimate the posterior distribution of the rate
parameter of an exponential distribution. The simulator in this case consists
of drawing N = 500 from an exponential distribution, parametrized by the
rate θ. The only statistic is the mean of the N draws. The observed value
y? = 9.42, was generated using 500 draws from an exponential distribution
with θ? = 0.1 and a fixed random seed. The prior on θ is a Gamma distri-
bution with parameters α = β = 0.1. Note that this is quite a broad prior.
We used a log-normal proposal distribution with σ = 0.1. The exponential
problem was also used to test different other approaches [25,42].
In figure 2 the convergence of KL-ABC compared to SL-ABC is shown.
On the vertical axis the total variation distance to the true posterior is shown.
The total variation distance is the integrated absolute difference between the
two probability density functions [22] and is computed as:
D(f, g) =1
2
∫ ∞−∞|f(x)− g(x)| dx (20)
We approximated the integral using a binned approach.
26
100 101 102 103 104
Number of samples0.0
0.2
0.4
0.6
0.8
1.0
Tota
lVar
iati
onD
ista
nce
KL-ABCSL ABC
101 102 103 104 105 106
Number of simulation calls0.0
0.2
0.4
0.6
0.8
1.0
Tota
lVar
iati
onD
ista
nce
KL-ABCSL ABC
Figure 2: Convergence of KL-ABC vs SL-ABC on the exponential problem.For both algorithms S was set to 20. The ε of SL-ABC was set to zero.Bandwidth h was computed using Silverman’s rule of thumb. Results areaveraged over 10 runs. Shaded areas denote 2 standard deviations.
The results in figure 2 are averaged over 10 runs and show that both ap-
proaches have similar performance. The marginal versions of the algorithms
were used, which means both numerator and denominator are re-estimated
each iteration. KL-ABC has a better approximation after fewer samples,
however after 10K samples, the SL-ABC algorithm has slightly lower bias.
The S parameter, that controls the number of simulations at each location,
was fixed to 20 for both algorithms. In the right plot of figure 2 it can be
seen that both algorithms used the same number of simulation calls.
MCMC runs were also performed for the adaptive variants of both algo-
rithms, ASL-ABC and AKL-ABC. The results are shown in figure 3. The
error controlling parameter ξ was set to 0.1 for both algorithms. The ini-
tial number of simulations S0 was set to 10 and ∆S to 2. Bandwidths were
estimated using Silverman’s rule (equation (35) in appendix A).
After 10K samples AKL performs slightly better on average: D = 0.0408
versus D = 0.0460 for ASL, but the errors are quite close. Therefore it is
more informative to look at the numbers of simulations needed, which are
27
100 101 102 103 104
Number of samples0.0
0.2
0.4
0.6
0.8
1.0
Tota
lVar
iati
onD
ista
nce
Exponential ProblemAKLASL
101 102 103 104 105 106
Number of simulation calls0.0
0.2
0.4
0.6
0.8
1.0
Tota
lVar
iati
onD
ista
nce
Exponential ProblemAKLASL
Figure 3: Convergence of AKL-ABC vs ASL-ABC on the exponential prob-lem. For both algorithms S0 was set to 10. The ε of ASL-ABC was set tozero. The ξ was set to 0.1. Results are averaged over 10 runs.
shown in the left subplot of figure 3. It can be seen that AKL does need
more simulations to obtain the 10K samples: on average 272263 opposed to
216897 of ASL, which is approximately 5 simulation calls more per sample.
The results of the global algorithms PSS-ABC and PKS-ABC are shown
in figure 4. Recall that PSS-ABC is the parametric algorithm and the
global counterpart of ASL-ABC, whereas PKS-ABC is non-parametric and
the global version of AKL-ABC.
The ξ for both algorithms was set to 0.15. For the horizontal kernel, on
the θ-axis, the Epanechnikov kernel (equation (30) in appendix A) was used
for both algorithms. PKS-ABC employs a Gaussian kernel in the y-direction.
The initial number of simulations was set to 10 for both algorithms. Silver-
man’s rule was used to set bandwidths. For both algorithms we projected
orthogonally, i.e. no local regression was performed. This is because we want
to show the sensitivity of PSS to outliers. When performing a linear correc-
tion, there are fewer outliers and hence the effect is less pronounced.
It can be seen that the PSS algorithm performs quite poorly in terms of
total variation distance. The posterior distribution that it obtains is however
28
100 101 102 103 104
Number of samples0.0
0.2
0.4
0.6
0.8
1.0
Tota
lVar
iati
onD
ista
nce
Exponential Problem
PSS orthPKS orth
100 101 102 103 104
Number of samples100
101
102
103
104
Num
ber
ofsi
mul
atio
nca
lls
Exponential ProblemPSS orthPKS orth
Figure 4: Convergence of PKS-ABC versus PSS-ABC on the exponentialproblem. For both algorithms S0 was set to 10, ∆S = 2, and ξ was set to0.15. Results are averaged over 10 runs.
closer to the true posterior than the total variation distance might imply.
This is illustrated in figure 5. It can be seen that the PSS posterior is
overly dispersed and shifted. The reason for this is illustrated in figure 6.
From this we can see that the kernel regression always overestimates the true
function. Since the estimate of the kernel regression is used as the mean
of the Gaussian in PSS-ABC, it will always overestimate the mean and as
a result the obtained posterior is shifted. The conditional distributions of
both PSS-ABC and PKS-ABC for θ = 0.1 are also shown. Note that the
Gaussian distribution of PSS-ABC is very flat. This is because there are some
projected samples from θ = 0.002 with y ≈ 500, which causes the computed
standard deviation to increase rapidly. A higher standard deviation will lead
to a wider posterior.
The PKS-ABC algorithm, which does not employ a Gaussian approxi-
mation, does not suffer from these problems. The conditional distribution
it infers has a heavy tail as can be seen in figure 6. The points projected
from locations close to zero, i.e. θ = 0.002, only cause small bumps in the
distribution. For example in the conditional distribution in figure 6 there is
29
0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20
θ0
10
20
30
40
50
60
70
80
90
π(θ|y
)
True posteriorKRS posterior
Figure 5: The posterior distribution obtained by PSS-ABC versus the trueposterior of the exponential problem. The vertical line is the true settingθ? = 0.1.
0.05 0.10 0.15 0.20 0.25
θ-40
-20
0
20
40
60
80
y
PSS ConditionalPKS ConditionalKernel regressionTrue function
Figure 6: The failure of kernel regression to model the exponential meanfunction properly. The vertical line is the true setting θ? = 0.1, the horizontalline is the observed value y? = 9.42. The conditional distributions of bothPSS and PKS are also shown. These are computed on the same data pointsand both share the same outliers.
30
a bump of 0.000113 at the location y = 513, but this is not influencing the
mode at y?.3
A notable result is that the number of simulation calls keeps increasing.
We think the recalculation of the bandwidth causes this. In general: the
more training points, the smaller the optimal bandwidth.4 As additional
training samples are obtained, the optimal bandwidth will (in general) be-
come smaller. With a smaller bandwidth fewer points will be effectively
included in an estimate and hence the variance of the estimate will increase.
This increased variance leads to increased uncertainty and hence acquisition
of additional training points. It should be noted that the right plot in figure
4 is shown on a log-log scale and hence the rate at which the number of
simulations calls is increasing, decreases.
Note that both global methods require far fewer simulation calls than their
local counterparts: After 10K samples PSS has performed 1999 simulation
calls and PKS 485. Compared to the 272207 calls of AKL or the 216927 of
ASL, this is a big gain.
4.2 Multimodal Problem
Multimodal distributions are generally poorly modelled by a Gaussian dis-
tribution. A multimodal distribution will therefore be used to illustrate the
shortcomings of some of the algorithms.
For this experiment, the simulator consist of a mixture of two functions.
The resulting function can be described formally as:
Multimodal(θ) =
3 + 2 · sin(θ) + ε with probability ρ
3 + 6 · sin(θ) + ε with probability 1− ρ(21)
Where ε is Gaussian noise with σ = 0.5 and ρ controls the level of multi-
modality. We used a value of 0.5 for ρ. The observed value y? was set to 7.
3Note that this is not shown in figure 6 because otherwise the figure became too clut-tered.
4This is also reflected in Silverman’s rule (equation (35) in appendix A) in the divisionby the number of training points.
31
0 1 2 3 4 5 6θ
-4
-2
0
2
4
6
8
10
y
Figure 7: A plot of the multimodal simulator. The gray-shaded areas are2 standard deviations of the Gaussian noise. The horizontal blue line isthe observed value y?. At nearly all parameter locations the conditionaldistribution has multiple modes.
The prior distribution imposed on θ is a uniform distribution on the interval
[0, 2π]. Finally, the proposal distribution is a Gaussian with σ = 32. A visu-
alisation of this problem is shown in figure 7. It can be seen that at nearly
every parameter location the conditional distribution has two modes.
For the local methods we will only look at the performance of the adaptive
versions of the algorithms. This is because the results are very similar to the
non-adaptive versions and differ only in the number of simulation calls: if
the S parameter is set sufficiently high, the errors will be similar, because
on every location the non-adaptive algorithms will at least perform as well
as the adaptive counterparts.
Moreover, the adaptive algorithms have a parameter to control the MH
acceptance error, which can be used to show the bias: if the error param-
eters are set to the same value and the algorithms have the same bias, the
resulting error should be the same. If however, the resulting errors differ,
this means that one algorithm is more biased than another. The purpose
of this experiment is to show that the parametric methods are biased when
32
100 101 102 103 104
Number of samples0.0
0.2
0.4
0.6
0.8
1.0
Tota
lVar
iati
onD
ista
nce
Multimodal ProblemAKLASL
101 102 103 104 105 106
Number of simulation calls0.0
0.2
0.4
0.6
0.8
1.0
Tota
lVar
iati
onD
ista
nce
Multimodal ProblemAKLASL
Figure 8: Convergence of ASL-ABC vs AKL-ABC on the multimodal prob-lem. Where S0 = 10,∆S = 2 and ξ = 0.15. Results are averaged over 10runs.
dealing with non-Gaussian densities.
In figure 8 the performances of ASL-ABC and AKL-ABC is shown. The
initial number of simulations S0 was set to 10, the error threshold ξ = 0.1
and ∆S = 2. The ε of ASL was set to zero. For AKL we used the Gaussian
kernel with Silverman’s rule for bandwidth selection.
What immediately catches the eye is that AKL has much lower final
error: 0.58 versus 0.38. This can be explained by the fact that this scenario
is poorly modelled by the Gaussian and hence has a biased estimate.
The performance of AKL is better than ASL, but still roughly 40% of the
probability mass mismatches the true posterior. This is because of the band-
width selection method we employed. Silverman’s rule optimality is based
on an underlying Gaussian density. Because there is no Gaussian density,
but rather a multimodal one, the estimate of the bandwidth is too high and
leads to suboptimal performance. In appendix C additional experiments are
performed involving different bandwidth selection methods. The conclusions
are that Sheather-Jones (SJ) estimator performs significantly better than Sil-
verman’s rule, since the SJ estimator does not assume that the underlying
33
density is Gaussian. Furthermore, the results using Silverman’s rule can be
enhanced by dividing the bandwidth by a fixed value.
The results of the global methods are shown in figure 10. The kernel for
the kernel weights is the Epanechnikov kernel (equation (30) in appendix A).
The number of initial simulations, S0, of both algorithms was set to 10 and
∆S to 2. The error controlling parameter was set to ξ = 0.15. For PKS we
set the kernel in the y-direction to the Gaussian kernel. As before, bandwidth
are estimated using Silverman’s rule.
It can be seen that PKS-ABC outperforms PSS-ABC, which was to be
expected. If the performances of both algorithms are compared to the local
counterparts, it can be seen that they are almost on par with them. The
global methods do have a slightly higher variance, but need far fewer simula-
tor calls: after 10K samples PKS used 890 simulations calls on average and
PSS 159, compared to 173148 and 200449 of AKL and ASL, respectively.
The big difference between PKS and PSS is due to KDE. Because it can
fit the multimodality better, it needs more samples to obtain the same MH
error as the biased estimate of PSS. Another way to view this: if PSS were
to obtain the same total variation distance, it would require fewer simula-
tion calls, than it does now. Still, both global algorithms show a big gain in
simulation efficiency. However, neither algorithm reaches a point where no
more simulations are required.
The differences in total variation distance are reflected in the inferred
posterior distributions, shown in figure 9. There it can be seen that PSS
misses the multimodality in the posterior, while PKS obtains a decent ap-
proximation.
4.3 Blowfly Problem
The last experiment considers simulation of blowfly populations over time.
The behaviour of such a system is modelled using (discretized) differential
equations. For some parameter settings these equations can lead to chaotic
fluctuations [46].
This dynamical system generates N1, . . . , NT for each time step using the
34
0 1 2 3 4 5 6
θ0.0
0.5
1.0
1.5
2.0
2.5
π(θ|y
)
True posteriorPSS linear
0.0 0.5 1.0 1.5 2.0 2.5 3.0
θ0.0
0.5
1.0
1.5
2.0
2.5
π(θ|y
)
True posteriorPKS linear
Figure 9: Posterior distributions of the multimodal problem inferred by PSS(left) and PKS (right).
100 101 102 103 104 105
Number of samples0.0
0.2
0.4
0.6
0.8
1.0
Tota
lVar
iati
onD
ista
nce
Multimodal ProblemPKS linearPSS linear
100 101 102 103 104 105
Number of samples100
101
102
103
104
Num
ber
ofsi
mul
atio
nca
lls
Multimodal ProblemPKS linearPSS linear
Figure 10: Convergence of PKS-ABC vs PSS-ABC on the multimodal prob-lem. Where S0 = 10,∆S = 2 and ξ = 0.15. Results are averaged over 10runs.
35
0 20 40 60 80 100 120 140 160 180Time
0
2
4
6
8
10
N/1
000
Blowfly Problem
GeneratedObserved
Figure 11: Example time series generated by equation (22) using a θ obtainedby PKS-ABC. The black line is the observed time series.
following update equation:
Nt+1 = PNt−τ exp
[−Nt−τ
N0
]et +Nt exp [−δεt] (22)
where τ is an integer and et ∼ Gam(σ−2p , σ−2p ) and εt ∼ Gam(σ−2d , σ−2d )
are sources of noise. There are six parameters for this problem, i.e. θ =
{logP, log δ, logN0, log σd, log σp, τ}. An example of time series generated
using parameter settings inferred by PKS-ABC is shown in figure 11. For
all but one parameter a Gaussian prior distribution is used. The exception
being the τ parameter, for which a Poisson prior is employed. The proposal
for the parameters that have a Gaussian prior, is also a Gaussian. On τ
we impose a left/right incremental proposal distribution that proposes either
to keep the current location, or one of the direct neighbors of the current
36
integer. That is:
Left-Right-Proposal(u) =
u with probability 0.5
u− 1 with probability 0.25
u+ 1 with probability 0.25
(23)
The choice of statistics is very important for any ABC problem [1, 10].
In total 10 statistics are used: the log of the mean of the 25% quantiles
of N/1000 (4 statistics), the mean of all 25% quantiles of the first-order
differences of N/1000 (4 statistics) and the number of peaks the smoothed
time series for 2 different thresholds (2 statistics). Note that these are the
same statistics as in [25].
Since there is no closed form likelihood function, the total variation dis-
tance cannot be computed. Therefore, instead of comparing with a true
likelihood, which is in this case impossible, we look at the convergence to
the observed values. For every iteration n a parameter vector θn is added to
the list of samples after rejecting or accepting. For each θn a corresponding
simulator output yn is generated. Using these simulator outputs yn at every
iteration, the convergence to the observed value y? can be measured.
The convergence of the expected value of y to y? can be computed using
the normalized mean square error (NMSE). This error term can conveniently
be computed in an online fashion by using a running average:
NMSE =1
N
N∑n=1
(yn − y?)2
(y?)2(24)
Where yn is the simulator output associated with the parameter θ at iteration
n The NMSE per simulation of Rejection ABC, KL-ABC, SL-ABC, PSS-
ABC and PKS-ABC are shown in figure 12.
For Rejection ABC we set the epsilon to 5. The number of simulations
per parameter location of SL and KL was set to 50. For SL a diagonal
covariance matrix was estimated, instead of a full covariance matrix. This
was done, because with 6 parameters, the full covariance matrix has 36 entries
37
to be estimated. When using 50 training samples to estimate 36 values, the
estimates are very noisy and lead to a poor estimate. The epsilon for SL-ABC
was set to 0.5, as it lead to better mixing.
For the KL algorithm we used the parametrized Gaussian kernel, as men-
tioned in appendix C, with a σ = 0.5 . Silverman’s rule is employed to
compute bandwidths.
The global methods had ξ set to 0.15 and ∆S = 2. The kernel in the θ
direction to weight the samples in the simulation history is a parametrized
Gaussian kernel with σ = 0.5. The kernel for the local regression is the
tricube kernel, which is formulated as:
k(u) =70
81(1− |u|3)31(|u| ≤ 1) (25)
Where 1(·) is the indicator function.
However, for the PSS algorithm we used the orthogonal version, so the
LOWESS was not used. On the other hand, for the PKS algorithm we do
perform a local regression. The kernel in the y direction is an ordinary
Gaussian distribution.
It is shown in figure 12 that PKS-ABC converges to the observed values
orders of magnitude faster than the other algorithms in terms of simulation
calls. Interesting to look at is the NMSE of the other algorithms at the final
number of simulations of PKS. For every statistic PKS has the lowest NMSE
for that number of simulations.
There are some unexpected results for SL-ABC in particular. The initial
NMSEs are orders of magnitude higher than the other algorithms. Surpris-
ingly perhaps, the final NMSEs are comparable.
Next to this convergence to the observed values, we can also compare
the posterior distributions inferred by the different algorithms. The inferred
posteriors of the first three statistics are shown in figure 13. The shown
posteriors are based on 1 run of each algorithm.
The posterior distributions that rejection sampling produces are broader
than the others, but most posteriors share generally the same mode. Recall
that the global algorithms converged much faster in terms of simulation calls.
38
101 102 103 104 105 106
Simulations
10−3
10−2
10−1
100
101
102
103 log q1
101 102 103 104 105 10610−2
10−1
100
101
102
103
104
NM
SE
log q2
101 102 103 104 105 10610−6
10−5
10−4
10−3
10−2
10−1
100
101
102 log q3
101 102 103 104 105 10610−5
10−4
10−3
10−2
10−1
100
101 log q4
101 102 103 104 105 10610−4
10−3
10−2
10−1
100
101
102
103
104
NM
SE
del q1
101 102 103 104 105 10610−3
10−2
10−1
100
101
102
103 del q2
101 102 103 104 105 10610−5
10−4
10−3
10−2
10−1
100
101
102
103
104 del q3
101 102 103 104 105 106
Simulations
10−3
10−2
10−1
100
101
102
103
104
NM
SE
del q4
101 102 103 104 105 106
Simulations
10−4
10−3
10−2
10−1
100
101 mx peaks 0.5
PKS linearPSS linear
KLReject
SL
Figure 12: The NMSE convergence of different algorithms for different statis-tics. Each subplot shows the convergence as a function of the number ofsimulations.
9: end for10: S ← S + c11: c← ∆S12: Estimate µ, Σ and µ′, Σ′ using equations (8) and (9)13: for m← 1 to M do14: µm ∼ N (µ, S−1Σ)15: µ′m ∼ N (µ′, S−1Σ′)16: Set αm using equation (10)17: end for18: τ ← median(α)19: Set E(α) using equation (12)20: until E(α) ≤ ξ21: if U(0, 1) ≤ τ then22: return θ′
23: end if24: return θ25: end procedure
58
C Kernel choice and bandwidth selection
The performance of the different kernel methods proposed in this thesis,
depends on the bandwidth and kernel selection. In this section a couple of
remarks and suggestions are made about the selection of both.
C.1 Kernel choice
For the kernel in the y-direction it is desirable to have a kernel function with
infinite support, such as the Gaussian kernel. To see why this is the case,
image a parameter location that yields simulator values that are very far from
the observed ones. In other words at θ the simulated y’s are different to y?.
Next a new parameter location θ′ is proposed, which is still not in proximity
of the observed value. If we were to employ a kernel with fixed support,
the probabilities π(y? | θ, x) and π(y? | θ′, x) will both be zero. Hence,
the decision whether to reject or accept a proposed parameter location, will
depend exclusively on the prior.6 Consequently, the algorithm becomes more
like rejection ABC in terms of simulation costs.
Thus it is desirable to choose only kernels for the y-direction that have
infinite support, for example the Gaussian kernel.
C.2 Bandwidth selection
In the optimal scenario the bandwidth selection method, deals perfectly with
the scale of the data. However, when using rules of thumb, this is seldom
the case. It should be noted that more sophisticated bandwidth selection
algorithms can be used, such as the Sheather-Jones estimator.
To illustrate the difference, the performance of the KL-ABC algorithm
using two different bandwidth selection methods was tested on two problems.
The problems are the two toy problems from section 4: the exponential
problem and the multimodal problem. For both problems the used kernel
6If a symmetric proposal is used and two probabilities are cancelled against each other.If you do not build a special case for this 0/0 case, the acceptance probability is undefined,which is even worse.
59
100 101 102 103 104
Number of samples0.0
0.2
0.4
0.6
0.8
1.0
Tota
lVar
iati
onD
ista
nce
ExponentialKL SilvermanKL SJ
100 101 102 103 104
Number of samples0.0
0.2
0.4
0.6
0.8
1.0
Tota
lVar
iati
onD
ista
nce
Multimodal
KL SilvermanKL SJ
Figure 16: The variation distance of KL-ABC using different bandwidthselection methods. Left the exponential problem, right the multimodal prob-lem.
was the Gaussian kernel and S was set to 20.
The results are shown in figure 16. As Silverman’s rule of thumb as-
sumes an underlying Gaussian density, it behaves poorly on the multimodal
problem. While in the exponential problem, where the underlying density is
Gaussian, both bandwidth selectors have similar performance. It should be
noted though that the Sheather-Jones estimator is computationally more ex-
pensive (O(n2) compared to O(n) for Silverman’s rule, where n is the number
of training points).
Instead of using this costly estimator, we can also manually make the
Gaussian smaller and thus act like the bandwidth is actually smaller than
the rule of thumb selected. This gives the parametrized Gaussian kernel:
kσ(u) =1√2πσ
exp
[− 1
2σ2u2]
(47)
Note that this is equivalent to multiplying the computed bandwidth by a
fixed value σ2.
The influence of different settings of σ on the total variation distance
60
100 101 102 103 104
Number of samples0.0
0.2
0.4
0.6
0.8
1.0
Tota
lVar
iati
onD
ista
nce
MultimodalAKL σ2 = 1
AKL σ2 = 12
AKL σ2 = 14
Figure 17: The effect of smaller sigmas when using Silverman’s rule of thumb.
are shown in figure 17. For each experiment Silverman’s rule of thumb was
used for bandwidth selection. This way, similar performance to the more
expensive estimator can be obtained, while avoiding the costs. The optimal
value of σ2 depends on the problem and can therefore not be set beforehand
perfectly, but we found that a value of 1/4 generally works fine.