Non-parametric Approximate Bayesian Computation for ... · Non-parametric Approximate Bayesian Computation for Expensive Simulators Steven Laan 6036031 Master’s thesis 42 EC Master’s

Non-parametric Approximate BayesianComputation for Expensive Simulators

Steven Laan6036031

Master’s thesis42 EC

Master’s programme Artificial Intelligence

University of Amsterdam

SupervisorTed Meeds

A thesis submitted in conformity with the requirements for the degree ofMSc. in Artificial Intelligence

August 26, 2014

Acknowledgements

I would like to express my gratitude to my supervisor Ted Meeds for his

patience and advice every time I stumbled into his office. Furthermore I

would like to thank Max Welling for his helpful suggestions and remarks

along the way.

I would like to thank my thesis committee Henk, Max and Ted for agreeing

on a defence date on a rather short notice. I would like to thank Henk Zeevat

in particular, for chairing the committee just after his vacation.

Finally, I would like to thank my girlfriend and family for their endless

support.

i

Abstract

This study investigates new methods for Approximate Bayesian Computa-

tion (ABC). The main goal is to create ABC methods that are easy to use

by researchers in other fields and perform well with expensive simulators.

Kernel ABC methods have been used to approximate non-Gaussian densi-

ties, however, these methods are inefficient in use of simulations in two ways.

First, they do not assess the required number of simulations. For some pa-

rameter settings only a small number of simulations is needed. Second, all

known methods throw away the entire history of simulations, ignoring the

valuable information it holds.

This thesis addresses these problems by introducing three new algorithms:

Adaptive KDE Likelihood ABC (AKL-ABC), Projected Synthetic Surrogate

ABC (PSS-ABC) and Projected KDE Surrogate ABC (PKS-ABC). The first

one, AKL-ABC, performs a kernel density estimation at each parameter lo-

cation and adaptively chooses the number of simulations at each location.

The other two algorithms take advantage of the simulation history to

estimate the simulator outcomes at the current parameter location. The

difference between them is that PSS-ABC assumes a Gaussian conditional

probability and hence is a parametric method, whereas PKS-ABC is non-

parametric by performing a kernel density estimate.

Experiments demonstrate the advantages of these algorithms in particular

with respect to the number of simulations needed. Additionally, the flexibility

of non-parametric methods is illustrated.

ii

List of Abbreviations

ABC Approximate Bayesian Computation

MCMC Markov chain Monte Carlo

KDE Kernel density estimate

GP Gaussian process

MH Metropolis-Hastings

LOWESS Locally weighted linear regression

(A)SL (Adaptive) Synthetic Likelihood

(A)KL (Adaptive) KDE Likelihood

PSS Projected Synthetic Surrogate

PKS Projected KDE Surrogate

List of Symbols

θ Vector of parameters

y? Vector of observed values

E(·) Expected value of a random variable

U(a, b) Uniform distribution on the interval a, b

N (µ,Σ) Normal distribution with mean vector µ

and covariance matrix Σ

Gam(α, β) Gamma distribution with shape α and rate β

kh(·) A kernel function with bandwidth h

π(x | y) Probability of x given y

KDE(x? | k, h,w,X) The kernel density estimate at location x?,

using kernel k with bandwidth h and data points X,

each with a weight from weight vector w.

iii

Contents

1 Introduction 1

1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Thesis Contents . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Approximate Bayesian Computation 5

2.1 Rejection ABC . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 MCMC Methods . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Marginal and Pseudo-marginal Samplers . . . . . . . . 8

2.2.2 Synthetic Likelihood . . . . . . . . . . . . . . . . . . . 9

2.2.3 Adaptive Synthetic Likelihood . . . . . . . . . . . . . . 10

2.3 KDE Likelihood ABC . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Population Methods . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Surrogate Methods . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5.1 Simulator Surrogate ABC . . . . . . . . . . . . . . . . 14

2.5.2 Likelihood Surrogate ABC . . . . . . . . . . . . . . . . 15

3 Kernel ABC Methods 16

3.1 Overview of ABC methods . . . . . . . . . . . . . . . . . . . . 16

3.2 Adaptive KDE Likelihood ABC . . . . . . . . . . . . . . . . . 17

3.3 Kernel Surrogate Methods . . . . . . . . . . . . . . . . . . . . 18

3.3.1 Projected Synthetic Surrogate . . . . . . . . . . . . . . 19

3.4 Projected Kernel Density Estimated Surrogate . . . . . . . . . 22

3.5 Ease of Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Experiments 26

4.1 Exponential Toy Problem . . . . . . . . . . . . . . . . . . . . 26

4.2 Multimodal Problem . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Blowfly Problem . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Conclusion and Discussion 42

iv

A Kernel Methods 49

A.1 Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . 50

A.2 Bandwidth Selection . . . . . . . . . . . . . . . . . . . . . . . 51

A.3 Adaptive Kernel Density Estimation . . . . . . . . . . . . . . 53

A.4 Kernel regression . . . . . . . . . . . . . . . . . . . . . . . . . 54

B Pseudocode for SL and ASL 57

C Kernel choice and bandwidth selection 59

C.1 Kernel choice . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

C.2 Bandwidth selection . . . . . . . . . . . . . . . . . . . . . . . 59

v

1 Introduction

Since the beginning of mankind, we have wanted to understand our sur-

roundings. Traditionally, this was done by observing and interacting with

the world. A scientist or philosopher has an idea or model of how things

work and tries to test his hypothesis in the real world. As our understand-

ing grew, the concepts and models involved in explaining the world became

more complex. These models of nature can often be viewed as a black box

with certain parameters or “knobs”, accounting for different situations. For

example a rabbit population model can have knobs for the number of initial

organisms in the population, the number of female rabbits and maybe others

to account for predators or forces of nature. If someone comes up with such

a model, it has to be verified on some real world data; we must be certain

that it is a good model for the intended purpose. To test whether a model

is adequate, observations or measurements in the real world are gathered.

To fit the model to observations, its parameters are adjusted so that the

model outputs best match the observations. From a probabilistic viewpoint

the problem becomes: what parameter settings are likely to result in the

observed values?

In most cases, scientists have intuitions about the ranges of the different

parameters. Sometimes they might even know the exact setting. However,

as the models become more complex, it also becomes harder to set these

parameters. Thus it would be convenient if there were some methods to do

this automatically.

These methods are called inference methods with Bayesian inference be-

ing the largest area within this topic. Bayesian inference relies on Bayes’

Theorem which states that the posterior distribution of parameters given

the observed values can be calculated using the likelihood function, which

tells you how likely a value is obtained given a certain parameter setting, and

a prior over the parameters. More specifically: the posterior is proportional

to the prior times the likelihood.

One problem of Bayesian inference for simulation-based science is that

it needs to evaluate the likelihood function. For most complex models this

1

likelihood is either unknown or intractable to compute. This is where Ap-

proximate Bayesian Computation (ABC) comes in, the main subject of this

thesis. Approximate Bayesian Computation, sometimes called likelihood-free

methods, performs an approximation to Bayesian inference, without using

the likelihood function.

Different methods for ABC already exist. However, these are often tar-

geted at fast simulators or make assumptions about the underlying likeli-

hood. This thesis introduces non-parametric alternatives to existing para-

metric methods that aim to overcome the shortcomings of their counterparts.

1.1 Goals

The main focus of this work is on algorithms for expensive simulators, where

it is desirable to keep the number of simulation calls to a minimum. An

example of an expensive simulator is the whole cell simulation [19], which

has 30 parameters and where a single simulation takes 24-48 core hours.

For these kind of simulators more sophisticated methods are required, that

take every single simulation result into account. Hence our research question

is: Can we build non-parametric ABC methods that take advantage of the

simulation history and have similar performance to existing methods?

It should be noted that there are two ways of interpreting the term non-

parametric. The first interpretation is that non-parametric methods do not

rely on assumptions that data belong to any specific distribution. The second

interpretation covers methods that adapt the complexity of the model to the

complexity of the data. In these kinds of models individual variables are

often still assumed to belong to a particular distribution. The structure of

the data is however not assumed to be fixed and can be adapted by adding

variables.

Our goal is to create non-parametric methods that do not rely on any

specific underlying distribution. Hence we use the first interpretation of

‘non-parametric’.

Another goal is to make the methods as easy to use as possible, preferably

plug and play. Hence there should be as few algorithm parameters as possible.

2

The simulation-based science community should not have to worry about the

details of the algorithm, only about the results it obtains.

Hence the augmented research question becomes: Can we build non-

parametric ABC methods that take advantage of the simulation history, have

similar performance to existing methods and are easy to use?

1.2 Contributions

The main contributions of this thesis are three new algorithms. The first

algorithm is an adaptive version of the existing kernel ABC. This algorithm,

AKL-ABC, keeps track of the Metropolis-Hastings acceptance error. It adap-

tively chooses the number of simulation calls such that the acceptance error

is below some threshold.

The other two algorithms take advantage of the whole simulation history.

Hence these methods are much cheaper than existing algorithms in terms

of simulation calls. Because simulation results from all locations are incor-

porated, instead of only simulations of the current location, we call these

methods global methods.

The second algorithm is called Projected Synthetic Surrogate ABC and

asssumes a Gaussian conditional distribution. All points in the simulation

history are projected onto the current parameter location and the weighted

sample mean and variance are estimated.

The third method, Projected KDE Surrogate ABC, does not make the

Gaussian assumption and is therefore better equipped to model non-Gaussian

conditional distributions, such as multimodal or heavy tailed distributions.

Instead of computing a weighted sample mean and variance, it performs a

weighted kernel density estimate.

1.3 Thesis Contents

This thesis is organized as follows. An introduction to Approximate Bayesian

Computation, along with descriptions of the different known algorithms and

their weaknesses is given in section 2. Section 3 describes the main con-

tributions of this thesis. These contrbutions are then tested and compared

3

with the known algorithms on different problems in section 4. Finally, the

experimental result are discussed and an outlook is given in section 5.

Throughout this thesis we assume the reader is familiar with probability

theory, calculus and kernel methods. We think this last subject is less well-

known, hence a short introduction to kernel methods and kernel density

estimation is provided in appendix A.

4

2 Approximate Bayesian Computation

Approximate Bayesian Computation (ABC) or likelihood-free methods are

employed to compute the posterior distribution when there is no likelihood

function available. This can either be because it is intractable to compute

the likelihood, or that a closed form just not exists.

Although the likelihood is unavailable, it is assumed that there does exist

a simulator that returns simulated samples. More formally, we can write a

simulation call as if it were a draw from a distribution π(x | θ):

xsim∼ π(x | θ) (1)

The problem that ABC solves is also known as parameter inference: Given

observations y?, what is the distribution of the parameters θ of the simulator

that could have generated these observations.

The idea behind ABC is that this simulator can be used to bypass the

likelihood function in the following way, by augmenting the posterior with a

draw from a simulator.

π(θ | y?) ∝ π(y? | θ) π(θ)

π(θ,x | y?) ∝ π(y? | θ,x) π(x | θ) π(θ) (2)

Where π(y? | θ,x) is weighting function or similarity measure of how close

the simulated x is to the observed data y?.

In this augmented system an approximation of the true posterior can be

obtained by integrating out the auxiliary variable x:

π(θ | y?) ∝ π(θ)

∫π(y? | θ,x) π(x | θ) dx (3)

Sometimes this approximation is called πLF (θ | y?) [39].

5

2.1 Rejection ABC

The first and most simple ABC method was introduced in a thought exper-

iment by Rubin [34] and is now known as the ABC rejection algorithm. At

each iteration a sample x is drawn from the simulator and is kept only if its

equal to the observed y?. The algorithm iterates until it has reaches a fixed

number of samples N , where N is usually set to some high number.

It is rather trivial that you end up with the exact posterior distribution

using this method. However, the algorithm is very inefficient, as it is rejecting

samples most of the time. This is where distance measures and (sufficient)

statistics come in. Instead of only keeping the sample if it is exactly equal

to the observed value, an error margin ε is introduced, sometimes called the

epsilon tube. Samples within this tube are accepted.

Another technique that is often employed in higher dimensional models

is the notion of sufficient statistics [12,18,29]. Instead of matching the exact

output of the real world, both the simulator and observed value are summa-

rized using statistics. This reduces dimensionality while, if the statistics are

sufficient, still retaining the same information.

The pseudocode for Rejection ABC with these two additions is shown

in algorithm 1. Here ρ(·) is the distance measure and S(·) is the statistics

function.

Algorithm 1 Rejection ABC with ε-tube and sufficient statistics

1: procedure Reject-ABC(y?, ε, N, π(θ), π(x | θ))2: n← 13: while n < N do4: repeat5: θn ∼ π(θ)

6: xsim∼ π(x | θn)

7: until ρ(S(x), S(y?)) ≤ ε8: n← n+ 19: end while

10: return θ1, . . . ,θN11: end procedure

6

2.2 MCMC Methods

While rejection ABC is an effective method when simulation is very cheap

and the space low dimensional, when either the simulator is computationally

expensive, or the output is high dimensional the rejection sampler is hopeless:

almost every sample from the prior will be rejected. This is because the

proposals from the prior will often be in regions of low posterior probability.

To address the inefficiency of rejection sampling, several Markov chain

Monte Carlo (MCMC) ABC methods exist [2, 3, 24].

A well-known algorithm that implements the MCMC scheme is the Metropolis-

Hastings (MH) algorithm. A Metropolis-Hastings scheme can be constructed

to sample from the posterior distribution. Hence, we need a Markov chain

with π(θ | y?) as stationary distribution.

As before, the state of the chain is augmented to (θ,x), where x is gen-

erated from the simulator with parameter setting θ.

Consider the following factorization of the proposal for the chain with

augmented states:

q ((θ′,x′) | (θ,x)) = q(θ′ | θ) π(x′ | θ′) (4)

That is, when in state (θ,x), first propose a new parameter location θ′ using

the proposal distribution: θ′ ∼ q(θ′ | θ). Then a simulator output x′ is

simulated at location θ′: x′sim∼ π(x | θ′).

Now the acceptance probability α can be formulated as [39]:

α = min

(1,π(θ′,x′ | y?)π(θ,x | y?)

q((θ,x) | (θ′,x′))q((θ′,x′) | (θ,x))

)

= min

(1,π(y? | θ′,x′) π(x′ | θ′) π(θ′)

π(y? | θ,x) π(x | θ) π(θ)

q(θ | θ′) π(x | θ)

q(θ′ | θ) π(x′ | θ′)

)

= min

(1,π(y? | θ′,x′) π(θ′) q(θ | θ′)π(y? | θ,x) π(θ) q(θ′ | θ)

)(5)

The resulting MCMC procedure is shown in algorithm 2. We will later refer to

lines 4 to 11 as the MH-step of the algorithm, where a new parameter location

7

is proposed and either accepted or rejected. Pseudocode of algorithms later

in this thesis will only contain their MH-step procedure, since it is the only

part that is different.

Algorithm 2 Pseudo-code for likelihood-free MCMC.

1: procedure LF-MCMC(T, q, π(θ), π(x | θ),y?)2: Initialize θ0, x0

3: for t← 0 to T do4: θ′ ∼ q(θ′ | θ)

5: x′sim∼ π(x | θ′)

6: Set α using equation (5)7: if U(0, 1) ≤ α then8: (θt+1,xt+1)← (θ′,x′)9: else

10: (θt+1,xt+1)← (θt,xt)11: end if12: end for13: end procedure

2.2.1 Marginal and Pseudo-marginal Samplers

Given the definition of equation (5), an unbiased estimate of the marginal

likelihood can be obtained by using Monte Carlo integration [3]:

π(y? | θ) ≈ 1

Sπ(θ)

S∑s=1

π(y? | θ,xs) (6)

where xs is an independent sample from the simulator. Then in the accep-

tance probability the division by S cancels out to obtain:

α = min

(1,π(θ′)

∑Ss=1 π(y? | θ′,x′s) q(θ | θ′)

π(θ)∑S

s=1 π(y? | θ,xs) q(θ′ | θ)

)(7)

Note that the denominator does not have to be re-evaluated at every itera-

tion: it can be carried over from the previous iteration.

If this equation is plugged into algorithm 2 we obtain what is called the

pseudo-marginal sampler. It is known to suffer from poor mixing [37], which

8

means that the chain is slow in converging to the desired distribution. This

is due to the fact that the denominator is not recomputed: if we once get a

lucky draw that has high posterior probability, it will be hard to accept any

other sample and hence the chain can get stuck in that location.

Therefore, the marginal sampler was proposed. While it does not have the

same guarantees as the pseudo-marginal sampler, it does mix better. Next

to the numerator it also re-estimates the denominator. As a result a single

draw has less influence on the acceptance rate, because the next iteration it

is thrown away and a new sample is drawn. Hence, the chain less likely to

get stuck this way. It is however more costly in terms of simulation calls.

2.2.2 Synthetic Likelihood

Instead of approximating the likelihood term with a Monte Carlo integration,

Wood proposed to model the simulated values with a Gaussian distribution

[46]. This Gaussian can be estimated by calculating the sample mean and

sample variance:

µθ =1

S

S∑s=1

xs (8)

Σθ =1

S − 1

S∑s=1

(xs − µθ)(xs − µθ)> (9)

The probability of a pseudo-datum x given our parameter location θ is

set to the proposed Gaussian: π(x | θ) = N (µθ, Σθ). If this is plugged into

equation (3) and we choose to use a Gaussian kernel for π(y | x), the integral

can be computed analytically. This leads to the following probability for y

9

given θ:

π(y | θ) =

∫π(y | x) π(x | θ) dx

=

∫kh(y,x) π(x | θ) dx

=

∫N (0, ε2I)N (µθ, Σθ) dx

= N (µθ, Σθ + ε2I)

Hence assuming that the pseudo-data at each parameter location are dis-

tributed normally is equivalent to assuming that the underlying likelihood

function is Gaussian.

If this is plugged into the Metropolis-Hastings algorithm, the acceptance

probability becomes:

α = min

(1,N (µθ′ , Σθ′ + ε2I) π(θ′) q(θ | θ′)N (µθ, Σθ + ε2I) π(θ) q(θ′ | θ)

)(10)

The pseudo-code for the resulting Synthetic Likelihood ABC algorithm is

given in appendix B

2.2.3 Adaptive Synthetic Likelihood

One downside to Synthetic Likelihood ABC (SL-ABC) is that there is no

parameter to set the accuracy of the method. There is of course the number

of simulations at each location, but this number gives no guarantee in terms

of any error criterion. Moreover, at some parameter locations it may be more

prone to making errors in accepting/rejecting than others. Hence, sampling

the same number of times at each location may not be optimal.

Therefore, instead of retrieving an equal amount of samples at each loca-

tion the idea is to initially call the simulator S0 times. Additional simulations

are only performed when needed. More formally: keep track of the probabil-

ity of making a Metropolis-Hastings error in accepting/rejecting the sample.

When this error is larger than some threshold value ξ, more simulations are

10

required.

The first example of an adaptive method is the Adaptive Synthetic Like-

lihood algorithm (ASL-ABC) introduced by Meeds and Welling [25]. To be

able to estimate the probability of an acceptance error, an estimate of the

acceptance distribution is needed. This can be done by creating M sets of S

samples and calculating the acceptance probabilities for each set. To obtain

the samples, the simulator could be called S · (M − 1) times extra. However,

because the synthetic likelihood assumes a normal distribution for each slice,

we can derive that the variance of the mean is proportional to 1/S. Thus

instead of calling the simulator, sample means can be drawn from a normal

distribution .

Each sample mean µm can be used to calculate one acceptance probability

αm using equation (10). With these probabilities αm the acceptance error

can be computed. There are two possibilities for making an error:

1. Reject, while we should accept.

2. Accept, while we should reject.

Hence, the total error, conditioned on the uniform sample u ∼ U(0, 1), be-

comes:

Eu(α) =

1M

∑m [αm < u] if u ≤ τ

1M

∑m [αm ≥ u] if u > τ

(11)

Where τ is the decision threshold. The total unconditional error can then be

obtained by integrating out u:

E(α) =

∫Eu(α) U(0, 1) du (12)

This can be done by Monte Carlo integration or using grid values for u.

Equation (12) is known as the mean absolute deviation of p(α), which is

minimized for τ = median(α) [25]. Hence, in the algorithm the decision

threshold is set to median of the α samples. The pseudocode for the algorithm

is given in appendix B.

11

2.3 KDE Likelihood ABC

The SL-ABC and ASL-ABC algorithms assume that the likelihood function

at each parameter location is a Gaussian, which may not be the case. If

the underlying density has for example multiple modes or a heavy tail the

resulting Gaussian fit can be very poor.

Hence, instead of assuming a certain form of the likelihood, a non-parametric

estimate can be advantageous in certain scenarios [27, 41, 44]. Moreover,

Turner [41] shows that when using a kernel density estimate (KDE) there is

no need for sufficient statistics.

The acceptance probability when using a KDE becomes:

α = min

(1,

∑s kh(y

? − x′s) π(θ′) q(θ | θ′)∑s kh(y

? − xs) π(θ) q(θ′ | θ)

)(13)

Where kh(·) is a kernel function with bandwidth h. The resulting Metropolis-

Hastings step is shown in algorithm 3. Note that at each parameter loca-

tion the bandwidth is re-estimated. The algorithm performs a simulation at

both the current and the proposed location each iteration. Hence this is the

marginal version of the algorithm.

Algorithm 3 The MH-step for KL-ABC.

1: procedure KL-ABC MH-step(q,θ, π(θ), π(x | θ),y?, S,∆S)2: θ′ ∼ q(θ′ | θ)3: for s← 1 to S do4: xs

sim∼ π(x | θ)

5: x′ssim∼ π(x | θ′)

6: end for7: Set α using equation (13)8: if U(0, 1) ≤ α then9: return θ′

10: end if11: return θ12: end procedure

12

2.4 Population Methods

Instead of working with one sample of θ at a time, it is also possible to keep a

population of samples, sometimes called particles. These are the population

Monte Carlo (PMC) and sequential Monte Carlo (SMC) approaches and have

the advantage that the resulting samples are individually independent. This

is in contrast to single chain ABC-MCMC methods, where often thinning is

used to obtain more independent samples.1

Another advantage of population methods is that they are quite easily

parallelized and therefore may result in speedups [13].

A disadvantage of population methods is that they require more simu-

lation calls. For each particle at each iteration a simulation call is needed.

Therefore if you have N particles, population methods need N simulation

calls more per iteration compared to the single chain algorithms.

For costly simulators the number of simulation samples needs to be mini-

mized. For this reason, we will only focus on single chain methods. It should

be noted that the ideas behind the proposed algorithms could also be viable

for PMC or SMC algorithms.

2.5 Surrogate Methods

For fast simulators, local methods such as SL-ABC and marginal-ABC are

perfectly fine. If however it is costly to simulate, it is desirable to incorpo-

rate the information that you gained by simulations at all previous locations.

Therefore the question becomes: How can the information of the entire sim-

ulation history be used in the estimate for the current location?

The idea is to aggregate all information of the samples in a surrogate

function. This can be done in two ways: either you model the simulator or

you model the likelihood function with the surrogate. Both approaches will

be described in the next sections.

Surrogate methods have been implemented in different research areas

before [20, 21,32], but are relatively new in the field of ABC [25,45].

1Thinning is the process of only keeping every Nth sample. However, to keep a rea-sonable number of samples the chain needs to be run for a longer time.

13

2.5.1 Simulator Surrogate ABC

In the first case, the surrogate function tries to model the simulator, but it

should be computationally cheaper to run. Instead of calling the simulator

for the current θ location directly, first the surrogate function is queried

about its estimate. If the surrogate has small uncertainty about the queried

value, the value returned by the surrogate is treated like it was from the real

simulator. Otherwise, if the uncertainty is too high, additional samples are

obtained from the simulator. After enough samples, the surrogate should be

certain enough in all locations and hence not require any additional simulator

calls.

The first surrogate method of this kind is the Gaussian Process Surrogate

(GPS-ABC) by Meeds and Welling [25]. As the name suggests, they fitted

a Gaussian process to the acquired samples as a surrogate. They showed

that a huge reduction in simulator calls is possible, while retaining similar

performance.

A nice property of modelling the simulator is that previously acquired

samples can be used to train the model. Moreover, the results of different

runs of the algorithm can be merged to obtain a better approximation.

While Gaussian processes are a potent candidate for surrogate functions

there are a couple of problems that surface in practice. The computational

complexity of GPS-ABC is high. Moreover, to be able to run the algorithm,

the hyperparameters of the Gaussian process need to be set, which can be

difficult to elicit beforehand.

A downside of GPS-ABC is that for simulators with J output dimensions

the algorithm models J independent GPs instead of one J-dimensional out-

put. Hence the different outputs are assumed to be independent, which may

not be the case.

For multimodal and other non-Gaussian conditional probabilities, the

Gaussian assumption of GPS-ABC is incorrect. Which leads to incorrect

posterior distributions.

14

2.5.2 Likelihood Surrogate ABC

Instead of modelling the simulator, the (log-)likelihood surface can be mod-

elled. This is much easier than modelling a high-dimensional simulator, as

the likelihood is a one dimensional function of θ.

When a decent model of the likelihood surface has been obtained, regions

of unlikely space can be ruled out. That is, no more samples should be

accepted in regions of low likelihood.

Wilkinson first proposed this approach using Gaussian Processes [45].

To discard areas of low likelihood he employs a technique called sequential

history matching [9]. This is a hierarchical approach to rule out regions of

space. At each step a new model is built, modelling only the region that

was labelled plausible by the previous models. This way, each model gives a

more detailed view of the log-likelihood surface.

An advantage of this approach over the GPS-ABC algorithm that models

the simulator, is that only a single Gaussian process is fitted, whereas the

GPS-ABC has a GP for each dimension of the output. On the other hand,

each sequential history step a new GP is fit and hence needs to be tuned.

A downside of this approach is that the resulting likelihood surface is

tightly connected to the parameter settings for the experiment. For example

changing the error criterion ε changes the entire likelihood surface and hence

the results for one experiment are not directly useful for another.

15

3 Kernel ABC Methods

In this section we propose three new algorithms. However, first we give

short descriptions of both the existing algorithms and the newly proposed

algorithms. A table that provides an overview of the different properties of

the various algorithm is also presented.

3.1 Overview of ABC methods

A short description of each algorithm, where the ones denoted with an as-

terisk (*) will be proposed later in this thesis:

Synthethic Likelihood assumes a Gaussian conditional distribution, which

is estimated using a fixed number of local samples.

Adaptive Synthetic Likelihood assumes a Gaussian conditional distri-

bution, which is estimated using an adaptively chosen number of local

samples.

KDE Likelihood approximates the conditional distribution using a KDE

based on a fixed number of local samples.

Adaptive KDE Likelihood* approximates the conditional distribution us-

ing a KDE based on an adaptively chosen number of local samples.

Projected Synthetic Surrogate* assumes a Gaussian conditional distri-

bution. It is estimated using a weighted estimate of mean and variance

based on all samples from the simulation history, which are projected

onto the current parameter location.

Projected KDE Surrogate* approximates the conditional distribution us-

ing a weighted KDE of projected samples from the simulation history.

There are different properties that ABC algorithms can have. We put the

existing algorithms as well as the ones that will be proposed later in table 1.

For reading convenience the different abbreviations are stated in the “Abbre-

viation” column. The column labelled “Local” states whether the algorithm

16

is local or global in nature. Local algorithms only use samples on the param-

eter location, whereas global methods incorporate the samples from other

locations as well. The “Parametric” column states whether the algorithm

makes any assumptions about underlying structures. Finally, whether the

algorithm adapts the number of samples to the parameter location is stated

in the last column labelled “Adaptive”.

Algorithm Abbreviation Local Parametric Adaptive

Synthetic Likelihood SL Yes Yes No

Adaptive Synthetic Likelihood ASL Yes Yes Yes

KDE Likelihood KL Yes No No

Adaptive KDE Likelihood AKL Yes No Yes

Projected Synthetic Surrogate PSS No Yes Yes

Projected KDE Surrogate PKS No No Yes

Table 1: Different ABC algorithms and their properties.

3.2 Adaptive KDE Likelihood ABC

In the same way the adaptive synthetic likelihood differs from the original

synthetic likelihood, we built an adaptive version of the KDE Likelihood

algorithm.

As before, a distribution over acceptance probabilities is needed. Hence

we need the variance of the estimator.

One solution for this problem is to just get additional sets of S simula-

tions, but this is too costly and other methods are preferred.

The first solution is to use well-known asymptotic results for the variance

of KDE. The approximation for the variance is [14]:

σ(x) =

√π(x)

Nh

∫k(u)2 du (14)

Where π(x) is the KDE at location x, N is the number of training points,

h is the bandwidth and the integral is known as the roughness of the kernel,

which can be computed analytically for the commonly used kernels.

17

When the variance has been computed, several samples from the normal

distribution N (µ(x), σ(x)) can be drawn to create an acceptance distribu-

tion, where µ(x) = π(x) is the KDE at location x.

There are however two problems with this approach. The asymptotic

theory from which equation (??) is derived, is based on the scenario that

N →∞, whereas we are in a regime where we specifically want to limit the

number of simulation calls. Hence equation (14) is only valid for large values

of N .

The other problem is that we are now introducing a parametric estimate

into the non-parametric algorithm. The goal was to have as few assumptions

as possible, so any methods that do not assume a Gaussian likelihood are

preferred.

One such method is bootstrapping. A bootstrap sample of the simulated

values is taken and a new kernel density estimate is performed. This new

KDE yields another acceptance probability. Hence, this bootstrap process

can be repeated M times to obtain a distribution over acceptance probabili-

ties. This is the approach that is used in our new adaptive KDE Likelihood

(AKL-ABC) algorithm.

The pseudocode of this AKL-ABC is very similar to that of ASL-ABC.

The only difference is how the acceptance probabilities are estimated. The

pseudocode is shown in algorithm 4.

3.3 Kernel Surrogate Methods

We propose two new surrogate methods that aim to address the shortcomings

of the Gaussian process surrogate method described in section 2.5.1.

The reason we chose to model the simulator, instead of the likelihood

function, is that when modelling the simulator results for one parameter

setting of the algorithm can be reused for a different setting, as the mod-

elled simulator remains the same. This property is especially desirable when

working with simulators that are expensive.

When modelling the log-likelihood this is not necessarily the case, as the

likelihood surface is tightly connected to for example the ε parameter.

18

Algorithm 4 The MH-step for AKL-ABC.

1: procedure AKL-ABC MH-step(q,θ, π(θ), π(x | θ),y?, S0,∆S, ξ,M)2: θ′ ∼ q(θ′ | θ)3: S ← 04: c← S0

5: repeat6: for s← S to S + c do7: xs

sim∼ π(x | θ)


9: end for10: S ← S + c11: c← ∆S12: for m← 1 to M do13: Obtain bootstrap samples of x1, . . . ,xS and x′1, . . . ,x

′S

14: Set αm using equation (13)15: end for16: τ ← median(α)17: Set E(α) using equation (12)18: until E(α) ≤ ξ19: if U(0, 1) ≤ τ then20: return θ′


3.3.1 Projected Synthetic Surrogate

The GPS-ABC algorithm uses a Gaussian process as a surrogate function.

This GP provides for each parameter setting a mean and standard deviation.

The main idea behind Projected Synthetic Surrogate ABC (PSS-ABC) is

that we can compute these statistics using kernel regression.

The computation of the mean and standard deviation using kernel regres-

sion is done using equations (15) and (16):

f(θ?) = µ(θ?) =

∑Nn=1 kh(θn − θ?) · yn∑Nn=1 kh(θn − θ?)

(15)

19

θ

y

OrthogonalLinear correctedTrue density

θy

OrthogonalLinear correctedTrue density

Figure 1: Linear corrected projection versus orthogonal projection with dif-ferent numbers of training points.

σ2(θ?) =

∑Nn=1 kh(θn − θ?) · (yn − µ(θ?))2∑N

n=1 kh(θn − θ?)(16)

Where θ? is the proposed parameter location, θn is the nth parameter loca-

tion from the simulation history with its corresponding simulator output yn.

The derivation of equation (15) can be found in appendix A.4.

Kernel regression is a local constant estimator. This can be viewed as

an orthogonal projection of neighbouring points onto the vertical line at

the parameter location θ. A weighted average of the projected points is

calculated using the kernel weights. This is the kernel regression estimate

f(θ) for that location. However, projecting orthogonally can lead to overly

dispersed estimates, which is illustrated in figure 1.

Thus instead of always projecting orthogonally, we first perform a local

linear regression [7] and then project along this line, or hyperplane in higher

dimensions.

This idea of linear correction has been used in the ABC framework before

[4,5]. The difference is that up to now it is used it as a post processing step,

20

whereas here it is an integral part of the algorithm. Moreover, because it is

a post processing step, they project onto the θ axis, whereas we project on

a line perpendicular to that.

The locally weighted linear regression (LOWESS) [7] assumes the follow-

ing regression model:

y = Θβ + ε (17)

Where ε is a vector of zero mean Gaussian noise, Θ the N by D design matrix

consisting of N data points θn with dimension D and β is the D-dimensional

vector of regression coefficients. Note that we set θ0 = 1 and hence β0 is the

intercept.

To compute the estimated value y for a proposed location θ? locally

weighted regression first computes a kernel weight wn for each data point θn:

wn = kh(θn − θ?) (18)

Cleveland suggests to use the tricube kernel for these weights [7]. Then the

following system of equations needs to be solved to obtain the regression

coefficients:

∑w

∑wΘ1 · · · ∑

wΘD∑wΘ1

∑wΘ2

1 · · · ∑wΘ1ΘD

......

. . ....∑

wΘD

∑wΘDΘ1 · · ·

∑wΘ2

D

· β =

∑y∑

ywΘ1

...∑ywΘD

(19)

Where w is the vector of kernel weights for the N training points and Θd

denotes the dth column vector of Θ containing the dth entry of every training

point, i.e. Θ2 = [Θ12,Θ22, . . .ΘN2].

The resulting solution2 for β is the vector of regression coefficients for

the linear equation that describes the trend at location θ?. Therefore, this

β provides a hyperplane along which the data can be projected.

2Which can easily be solved by most linear algebra packages.

21

When β is the zero-vector, you are projecting along a flat hyperplane and

orthogonally on the θ? slice.

There are border cases where linear correction can get overconfident. For

example when there are few samples near the proposed location. In the

extreme case, this means the regression will only be based on two samples

and in effect the regression is the line through these points, which can be

very different than the actual trend going on.

To overcome this problem, the algorithm needs to recognise these situa-

tions or assess the uncertainty in these situations. An elegant solution is to

use smoothed bootstrapping [11]. Smoothed bootstrapping resamples from

a set of samples, but unlike ordinary bootstrapping it adds (Gaussian) noise

to the newly obtained samples. The variance of this noise depends on the

size of the set of original samples, usually σ = 1/√N . As a result there

is more variance in resampled values when there are few samples and hence

more uncertainty in the local regression. Because of this noise, the computed

hyperplanes will vary much more with a small pool of samples.

The pseudocode for PSS-ABC is shown in algorithm 5. Note that a diago-

nal covariance matrix is computed, which is equivalent to assuming indepen-

dence between the output variables. This is because a full rank covariance

matrix is much harder to estimate with few points.

The acquisition of new training points at line 21 is at either the current

parameter location θ or the proposed parameter location θ′, each with 0.5

probability. We note that more sophisticated methods could be implemented,

but this simple procedure worked fine in our experiments.

3.4 Projected Kernel Density Estimated Surrogate

Instead of assuming that the conditional distribution π(y | θ) is Gaussian,

this conditional distribution can also be approximated using a kernel density

estimate.

As with PSS-ABC, the points from the simulation history are projected

onto the θ? slice. Then, instead of computing the weighted mean and variance

of a Gaussian, a weighted kernel density estimate is performed.

22

Algorithm 5 The MH-step for PSS-ABC.

1: procedure PSS-ABC MH-step(q,θ, π(θ), π(x | θ),y?,∆S,M, ξ)2: θ′ ∼ q(θ′,θ)3: repeat4: Compute weights using equation (18)5: for m← 1 to M do6: Get smoothed bootstrap samples of y7: for j ← 1 to J do8: Compute βj and β′j using a local regression at θ and θ′

9: Project bootstrapped yj along hyperplane βj10: Compute µj, σj using equations (15) and (16)11: Project bootstrapped yj along hyperplane β′j12: Compute µ′j, σ

′j using equations (15) and (16)

13: end for14: µ← [µ1, . . . , µJ ] , Σ← diag(σ1, . . . , σJ)15: µ′ ← [µ′1, . . . , µ

′J ] , Σ′ ← diag(σ′1, . . . , σ

′J)

16: Set αm using equation (10)17: end for18: τ ← median(α)19: Set E(α) using equation (12)20: if E(α) > ξ then21: Acquire ∆S new training points22: end if23: until E(α) ≤ ξ24: if U(0, 1) ≤ τ then25: return θ′


23

The resulting MH-step is shown in algorithm 6.

Algorithm 6 The MH-step for PKS-ABC.

1: procedure PSS-ABC MH-step(q,θ,∆S,M, ξ,y?)2: θ′ ∼ q(θ′ | θ)3: repeat4: Compute weights using equation (18)5: for m← 1 to M do6: Get smoothed bootstrap samples of y7: p← 1, p′ ← 18: for j ← 1 to J do9: Compute βj and β′j using a local regression at θ and θ′

10: Zj ← projection of bootstrapped yj along hyperplane βj11: Z′j ← projection of bootstrapped yj along hyperplane β′j12: p← p ·KDE(y?j | k, h,Zj)13: p′ ← p′ ·KDE(y?j | k, h,Z′j)14: end for15: αm ← min

(1, q(θ|θ

′)π(θ′)p′

q(θ′|θ)π(θ)p

)16: end for17: τ ← median(α)18: Set E(α) using equation (12)19: if E(α) > ξ then20: Acquire ∆S new training points21: end if22: until E(α) ≤ ξ23: if U(0, 1) ≤ τ then24: return θ′


3.5 Ease of Use

The proposed methods aim to be as simple in use as possible. In contrast

to Gaussian processes, there is no difficult task to tune hyperparameters,

such as the length scales and covariance function. Moreover, GPs have to

be monitored for degeneracies over time, which means that the algorithm

cannot be run as a black box.

24

The kernel methods described earlier, do not have these problems; they

are nearly plug-and-play. The kernels and the bandwidth selection method

are the only parameters that need to be set. A couple of motivations are

made as to how to set these parameters in appendix C. We suggest to set the

kernel in the y direction to a kernel with infinite support, such as the Gaus-

sian kernel. For the bandwidth selection it is known that for non-Gaussian

densities Silverman’s rule of thumb overestimates the bandwidth. Hence

better performance may be achieved by consistently dividing the computed

bandwidth by a fixed value.

25

4 Experiments

We perform three sets of experiments:

1. Exponential problem: a toy Bayesian inference problem.

2. Multimodal problem: a toy Bayesian inference problem, with multiple

modes at some parameter locations.

3. Blowfly problem: inference of parameters in a chaotic ecological system.

The first two are mathematical toy problems to show correctness of our

proposed algorithms. The goal of the second problem is to illustrate differ-

ences between parametric and non-parametric ABC methods.

The last experiment is more challenging and allows a view of the perfor-

mance in a more realistic setting.

4.1 Exponential Toy Problem

The exponential problem is to estimate the posterior distribution of the rate

parameter of an exponential distribution. The simulator in this case consists

of drawing N = 500 from an exponential distribution, parametrized by the

rate θ. The only statistic is the mean of the N draws. The observed value

y? = 9.42, was generated using 500 draws from an exponential distribution

with θ? = 0.1 and a fixed random seed. The prior on θ is a Gamma distri-

bution with parameters α = β = 0.1. Note that this is quite a broad prior.

We used a log-normal proposal distribution with σ = 0.1. The exponential

problem was also used to test different other approaches [25,42].

In figure 2 the convergence of KL-ABC compared to SL-ABC is shown.

On the vertical axis the total variation distance to the true posterior is shown.

The total variation distance is the integrated absolute difference between the

two probability density functions [22] and is computed as:

D(f, g) =1

2

∫ ∞−∞|f(x)− g(x)| dx (20)

We approximated the integral using a binned approach.

26

100 101 102 103 104

Number of samples0.0

0.2

0.4

0.6

0.8

1.0

Tota

lVar

iati

onD

ista

nce

KL-ABCSL ABC

101 102 103 104 105 106

Number of simulation calls0.0

0.2

0.4

0.6

0.8

1.0

Tota

lVar

iati

onD

ista

nce

KL-ABCSL ABC

Figure 2: Convergence of KL-ABC vs SL-ABC on the exponential problem.For both algorithms S was set to 20. The ε of SL-ABC was set to zero.Bandwidth h was computed using Silverman’s rule of thumb. Results areaveraged over 10 runs. Shaded areas denote 2 standard deviations.

The results in figure 2 are averaged over 10 runs and show that both ap-

proaches have similar performance. The marginal versions of the algorithms

were used, which means both numerator and denominator are re-estimated

each iteration. KL-ABC has a better approximation after fewer samples,

however after 10K samples, the SL-ABC algorithm has slightly lower bias.

The S parameter, that controls the number of simulations at each location,

was fixed to 20 for both algorithms. In the right plot of figure 2 it can be

seen that both algorithms used the same number of simulation calls.

MCMC runs were also performed for the adaptive variants of both algo-

rithms, ASL-ABC and AKL-ABC. The results are shown in figure 3. The

error controlling parameter ξ was set to 0.1 for both algorithms. The ini-

tial number of simulations S0 was set to 10 and ∆S to 2. Bandwidths were

estimated using Silverman’s rule (equation (35) in appendix A).

After 10K samples AKL performs slightly better on average: D = 0.0408

versus D = 0.0460 for ASL, but the errors are quite close. Therefore it is

more informative to look at the numbers of simulations needed, which are

27

100 101 102 103 104


0.2

0.4

0.6

0.8

1.0

Tota

lVar

iati

onD

ista

nce

Exponential ProblemAKLASL

101 102 103 104 105 106


0.2

0.4

0.6

0.8

1.0

Tota

lVar

iati

onD

ista

nce

Exponential ProblemAKLASL

Figure 3: Convergence of AKL-ABC vs ASL-ABC on the exponential prob-lem. For both algorithms S0 was set to 10. The ε of ASL-ABC was set tozero. The ξ was set to 0.1. Results are averaged over 10 runs.

shown in the left subplot of figure 3. It can be seen that AKL does need

more simulations to obtain the 10K samples: on average 272263 opposed to

216897 of ASL, which is approximately 5 simulation calls more per sample.

The results of the global algorithms PSS-ABC and PKS-ABC are shown

in figure 4. Recall that PSS-ABC is the parametric algorithm and the

global counterpart of ASL-ABC, whereas PKS-ABC is non-parametric and

the global version of AKL-ABC.

The ξ for both algorithms was set to 0.15. For the horizontal kernel, on

the θ-axis, the Epanechnikov kernel (equation (30) in appendix A) was used

for both algorithms. PKS-ABC employs a Gaussian kernel in the y-direction.

The initial number of simulations was set to 10 for both algorithms. Silver-

man’s rule was used to set bandwidths. For both algorithms we projected

orthogonally, i.e. no local regression was performed. This is because we want

to show the sensitivity of PSS to outliers. When performing a linear correc-

tion, there are fewer outliers and hence the effect is less pronounced.

It can be seen that the PSS algorithm performs quite poorly in terms of

total variation distance. The posterior distribution that it obtains is however

28

100 101 102 103 104


0.2

0.4

0.6

0.8

1.0

Tota

lVar

iati

onD

ista

nce

Exponential Problem

PSS orthPKS orth

100 101 102 103 104

Number of samples100

101

102

103

104

Num

ber

ofsi

mul

atio

nca

lls

Exponential ProblemPSS orthPKS orth

Figure 4: Convergence of PKS-ABC versus PSS-ABC on the exponentialproblem. For both algorithms S0 was set to 10, ∆S = 2, and ξ was set to0.15. Results are averaged over 10 runs.

closer to the true posterior than the total variation distance might imply.

This is illustrated in figure 5. It can be seen that the PSS posterior is

overly dispersed and shifted. The reason for this is illustrated in figure 6.

From this we can see that the kernel regression always overestimates the true

function. Since the estimate of the kernel regression is used as the mean

of the Gaussian in PSS-ABC, it will always overestimate the mean and as

a result the obtained posterior is shifted. The conditional distributions of

both PSS-ABC and PKS-ABC for θ = 0.1 are also shown. Note that the

Gaussian distribution of PSS-ABC is very flat. This is because there are some

projected samples from θ = 0.002 with y ≈ 500, which causes the computed

standard deviation to increase rapidly. A higher standard deviation will lead

to a wider posterior.

The PKS-ABC algorithm, which does not employ a Gaussian approxi-

mation, does not suffer from these problems. The conditional distribution

it infers has a heavy tail as can be seen in figure 6. The points projected

from locations close to zero, i.e. θ = 0.002, only cause small bumps in the

distribution. For example in the conditional distribution in figure 6 there is

29

0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20

θ0

10

20

30

40

50

60

70

80

90

π(θ|y

)

True posteriorKRS posterior

Figure 5: The posterior distribution obtained by PSS-ABC versus the trueposterior of the exponential problem. The vertical line is the true settingθ? = 0.1.

0.05 0.10 0.15 0.20 0.25

θ-40

-20

0

20

40

60

80

y

PSS ConditionalPKS ConditionalKernel regressionTrue function

Figure 6: The failure of kernel regression to model the exponential meanfunction properly. The vertical line is the true setting θ? = 0.1, the horizontalline is the observed value y? = 9.42. The conditional distributions of bothPSS and PKS are also shown. These are computed on the same data pointsand both share the same outliers.

30

a bump of 0.000113 at the location y = 513, but this is not influencing the

mode at y?.3

A notable result is that the number of simulation calls keeps increasing.

We think the recalculation of the bandwidth causes this. In general: the

more training points, the smaller the optimal bandwidth.4 As additional

training samples are obtained, the optimal bandwidth will (in general) be-

come smaller. With a smaller bandwidth fewer points will be effectively

included in an estimate and hence the variance of the estimate will increase.

This increased variance leads to increased uncertainty and hence acquisition

of additional training points. It should be noted that the right plot in figure

4 is shown on a log-log scale and hence the rate at which the number of

simulations calls is increasing, decreases.

Note that both global methods require far fewer simulation calls than their

local counterparts: After 10K samples PSS has performed 1999 simulation

calls and PKS 485. Compared to the 272207 calls of AKL or the 216927 of

ASL, this is a big gain.

4.2 Multimodal Problem

Multimodal distributions are generally poorly modelled by a Gaussian dis-

tribution. A multimodal distribution will therefore be used to illustrate the

shortcomings of some of the algorithms.

For this experiment, the simulator consist of a mixture of two functions.

The resulting function can be described formally as:

Multimodal(θ) =

3 + 2 · sin(θ) + ε with probability ρ

3 + 6 · sin(θ) + ε with probability 1− ρ(21)

Where ε is Gaussian noise with σ = 0.5 and ρ controls the level of multi-

modality. We used a value of 0.5 for ρ. The observed value y? was set to 7.

3Note that this is not shown in figure 6 because otherwise the figure became too clut-tered.

4This is also reflected in Silverman’s rule (equation (35) in appendix A) in the divisionby the number of training points.

31

0 1 2 3 4 5 6θ

-4

-2

0

2

4

6

8

10

y

Figure 7: A plot of the multimodal simulator. The gray-shaded areas are2 standard deviations of the Gaussian noise. The horizontal blue line isthe observed value y?. At nearly all parameter locations the conditionaldistribution has multiple modes.

The prior distribution imposed on θ is a uniform distribution on the interval

[0, 2π]. Finally, the proposal distribution is a Gaussian with σ = 32. A visu-

alisation of this problem is shown in figure 7. It can be seen that at nearly

every parameter location the conditional distribution has two modes.

For the local methods we will only look at the performance of the adaptive

versions of the algorithms. This is because the results are very similar to the

non-adaptive versions and differ only in the number of simulation calls: if

the S parameter is set sufficiently high, the errors will be similar, because

on every location the non-adaptive algorithms will at least perform as well

as the adaptive counterparts.

Moreover, the adaptive algorithms have a parameter to control the MH

acceptance error, which can be used to show the bias: if the error param-

eters are set to the same value and the algorithms have the same bias, the

resulting error should be the same. If however, the resulting errors differ,

this means that one algorithm is more biased than another. The purpose

of this experiment is to show that the parametric methods are biased when

32

100 101 102 103 104


0.2

0.4

0.6

0.8

1.0

Tota

lVar

iati

onD

ista

nce

Multimodal ProblemAKLASL

101 102 103 104 105 106


0.2

0.4

0.6

0.8

1.0

Tota

lVar

iati

onD

ista

nce

Multimodal ProblemAKLASL

Figure 8: Convergence of ASL-ABC vs AKL-ABC on the multimodal prob-lem. Where S0 = 10,∆S = 2 and ξ = 0.15. Results are averaged over 10runs.

dealing with non-Gaussian densities.

In figure 8 the performances of ASL-ABC and AKL-ABC is shown. The

initial number of simulations S0 was set to 10, the error threshold ξ = 0.1

and ∆S = 2. The ε of ASL was set to zero. For AKL we used the Gaussian

kernel with Silverman’s rule for bandwidth selection.

What immediately catches the eye is that AKL has much lower final

error: 0.58 versus 0.38. This can be explained by the fact that this scenario

is poorly modelled by the Gaussian and hence has a biased estimate.

The performance of AKL is better than ASL, but still roughly 40% of the

probability mass mismatches the true posterior. This is because of the band-

width selection method we employed. Silverman’s rule optimality is based

on an underlying Gaussian density. Because there is no Gaussian density,

but rather a multimodal one, the estimate of the bandwidth is too high and

leads to suboptimal performance. In appendix C additional experiments are

performed involving different bandwidth selection methods. The conclusions

are that Sheather-Jones (SJ) estimator performs significantly better than Sil-

verman’s rule, since the SJ estimator does not assume that the underlying

33

density is Gaussian. Furthermore, the results using Silverman’s rule can be

enhanced by dividing the bandwidth by a fixed value.

The results of the global methods are shown in figure 10. The kernel for

the kernel weights is the Epanechnikov kernel (equation (30) in appendix A).

The number of initial simulations, S0, of both algorithms was set to 10 and

∆S to 2. The error controlling parameter was set to ξ = 0.15. For PKS we

set the kernel in the y-direction to the Gaussian kernel. As before, bandwidth

are estimated using Silverman’s rule.

It can be seen that PKS-ABC outperforms PSS-ABC, which was to be

expected. If the performances of both algorithms are compared to the local

counterparts, it can be seen that they are almost on par with them. The

global methods do have a slightly higher variance, but need far fewer simula-

tor calls: after 10K samples PKS used 890 simulations calls on average and

PSS 159, compared to 173148 and 200449 of AKL and ASL, respectively.

The big difference between PKS and PSS is due to KDE. Because it can

fit the multimodality better, it needs more samples to obtain the same MH

error as the biased estimate of PSS. Another way to view this: if PSS were

to obtain the same total variation distance, it would require fewer simula-

tion calls, than it does now. Still, both global algorithms show a big gain in

simulation efficiency. However, neither algorithm reaches a point where no

more simulations are required.

The differences in total variation distance are reflected in the inferred

posterior distributions, shown in figure 9. There it can be seen that PSS

misses the multimodality in the posterior, while PKS obtains a decent ap-

proximation.

4.3 Blowfly Problem

The last experiment considers simulation of blowfly populations over time.

The behaviour of such a system is modelled using (discretized) differential

equations. For some parameter settings these equations can lead to chaotic

fluctuations [46].

This dynamical system generates N1, . . . , NT for each time step using the

34

0 1 2 3 4 5 6

θ0.0

0.5

1.0

1.5

2.0

2.5

π(θ|y

)

True posteriorPSS linear

0.0 0.5 1.0 1.5 2.0 2.5 3.0

θ0.0

0.5

1.0

1.5

2.0

2.5

π(θ|y

)

True posteriorPKS linear

Figure 9: Posterior distributions of the multimodal problem inferred by PSS(left) and PKS (right).

100 101 102 103 104 105


0.2

0.4

0.6

0.8

1.0

Tota

lVar

iati

onD

ista

nce

Multimodal ProblemPKS linearPSS linear

100 101 102 103 104 105

Number of samples100

101

102

103

104

Num

ber

ofsi

mul

atio

nca

lls

Multimodal ProblemPKS linearPSS linear

Figure 10: Convergence of PKS-ABC vs PSS-ABC on the multimodal prob-lem. Where S0 = 10,∆S = 2 and ξ = 0.15. Results are averaged over 10runs.

35

0 20 40 60 80 100 120 140 160 180Time

0

2

4

6

8

10

N/1

000

Blowfly Problem

GeneratedObserved

Figure 11: Example time series generated by equation (22) using a θ obtainedby PKS-ABC. The black line is the observed time series.

following update equation:

Nt+1 = PNt−τ exp

[−Nt−τ

N0

]et +Nt exp [−δεt] (22)

where τ is an integer and et ∼ Gam(σ−2p , σ−2p ) and εt ∼ Gam(σ−2d , σ−2d )

are sources of noise. There are six parameters for this problem, i.e. θ =

{logP, log δ, logN0, log σd, log σp, τ}. An example of time series generated

using parameter settings inferred by PKS-ABC is shown in figure 11. For

all but one parameter a Gaussian prior distribution is used. The exception

being the τ parameter, for which a Poisson prior is employed. The proposal

for the parameters that have a Gaussian prior, is also a Gaussian. On τ

we impose a left/right incremental proposal distribution that proposes either

to keep the current location, or one of the direct neighbors of the current

36

integer. That is:

Left-Right-Proposal(u) =

u with probability 0.5

u− 1 with probability 0.25

u+ 1 with probability 0.25

(23)

The choice of statistics is very important for any ABC problem [1, 10].

In total 10 statistics are used: the log of the mean of the 25% quantiles

of N/1000 (4 statistics), the mean of all 25% quantiles of the first-order

differences of N/1000 (4 statistics) and the number of peaks the smoothed

time series for 2 different thresholds (2 statistics). Note that these are the

same statistics as in [25].

Since there is no closed form likelihood function, the total variation dis-

tance cannot be computed. Therefore, instead of comparing with a true

likelihood, which is in this case impossible, we look at the convergence to

the observed values. For every iteration n a parameter vector θn is added to

the list of samples after rejecting or accepting. For each θn a corresponding

simulator output yn is generated. Using these simulator outputs yn at every

iteration, the convergence to the observed value y? can be measured.

The convergence of the expected value of y to y? can be computed using

the normalized mean square error (NMSE). This error term can conveniently

be computed in an online fashion by using a running average:

NMSE =1

N

N∑n=1

(yn − y?)2

(y?)2(24)

Where yn is the simulator output associated with the parameter θ at iteration

n The NMSE per simulation of Rejection ABC, KL-ABC, SL-ABC, PSS-

ABC and PKS-ABC are shown in figure 12.

For Rejection ABC we set the epsilon to 5. The number of simulations

per parameter location of SL and KL was set to 50. For SL a diagonal

covariance matrix was estimated, instead of a full covariance matrix. This

was done, because with 6 parameters, the full covariance matrix has 36 entries

37

to be estimated. When using 50 training samples to estimate 36 values, the

estimates are very noisy and lead to a poor estimate. The epsilon for SL-ABC

was set to 0.5, as it lead to better mixing.

For the KL algorithm we used the parametrized Gaussian kernel, as men-

tioned in appendix C, with a σ = 0.5 . Silverman’s rule is employed to

compute bandwidths.

The global methods had ξ set to 0.15 and ∆S = 2. The kernel in the θ

direction to weight the samples in the simulation history is a parametrized

Gaussian kernel with σ = 0.5. The kernel for the local regression is the

tricube kernel, which is formulated as:

k(u) =70

81(1− |u|3)31(|u| ≤ 1) (25)

Where 1(·) is the indicator function.

However, for the PSS algorithm we used the orthogonal version, so the

LOWESS was not used. On the other hand, for the PKS algorithm we do

perform a local regression. The kernel in the y direction is an ordinary

Gaussian distribution.

It is shown in figure 12 that PKS-ABC converges to the observed values

orders of magnitude faster than the other algorithms in terms of simulation

calls. Interesting to look at is the NMSE of the other algorithms at the final

number of simulations of PKS. For every statistic PKS has the lowest NMSE

for that number of simulations.

There are some unexpected results for SL-ABC in particular. The initial

NMSEs are orders of magnitude higher than the other algorithms. Surpris-

ingly perhaps, the final NMSEs are comparable.

Next to this convergence to the observed values, we can also compare

the posterior distributions inferred by the different algorithms. The inferred

posteriors of the first three statistics are shown in figure 13. The shown

posteriors are based on 1 run of each algorithm.

The posterior distributions that rejection sampling produces are broader

than the others, but most posteriors share generally the same mode. Recall

that the global algorithms converged much faster in terms of simulation calls.

38

101 102 103 104 105 106

Simulations

10−3

10−2

10−1

100

101

102

103 log q1

101 102 103 104 105 10610−2

10−1

100

101

102

103

104

NM

SE

log q2

101 102 103 104 105 10610−6

10−5

10−4

10−3

10−2

10−1

100

101

102 log q3

101 102 103 104 105 10610−5

10−4

10−3

10−2

10−1

100

101 log q4

101 102 103 104 105 10610−4

10−3

10−2

10−1

100

101

102

103

104

NM

SE

del q1

101 102 103 104 105 10610−3

10−2

10−1

100

101

102

103 del q2

101 102 103 104 105 10610−5

10−4

10−3

10−2

10−1

100

101

102

103

104 del q3

101 102 103 104 105 106

Simulations

10−3

10−2

10−1

100

101

102

103

104

NM

SE

del q4

101 102 103 104 105 106

Simulations

10−4

10−3

10−2

10−1

100

101 mx peaks 0.5

PKS linearPSS linear

KLReject

SL

Figure 12: The NMSE convergence of different algorithms for different statis-tics. Each subplot shows the convergence as a function of the number ofsimulations.

39

−2 −1 0 1 2 3 4 5 60.000.050.100.150.200.250.300.350.40

Rej

ect

π(θ|y

)

−6 −4 −2 0 2 4 60.00.10.20.30.40.50.60.70.8

2 3 4 5 6 7 8 9 100.000.050.100.150.200.250.300.350.400.45

−2 −1 0 1 2 3 4 5 60.00.10.20.30.40.50.60.70.80.9

SLπ

(θ|y

)

−6 −4 −2 0 2 4 60.00.10.20.30.40.50.60.70.8

2 3 4 5 6 7 8 9 100.0

0.2

0.4

0.6

0.8

1.0

−2 −1 0 1 2 3 4 5 60.00.20.40.60.81.01.21.4

KL

π(θ|y

)

−6 −4 −2 0 2 4 60.0

0.5

1.0

1.5

2.0

2.5

2 3 4 5 6 7 8 9 100.00.20.40.60.81.01.21.41.61.8

−2 −1 0 1 2 3 4 5 60.0

0.2

0.4

0.6

0.8

1.0

1.2

PSS

linea

rπ

(θ|y

)

−6 −4 −2 0 2 4 60.00.20.40.60.81.01.21.41.61.8

2 3 4 5 6 7 8 9 100.00.20.40.60.81.01.21.4

−2 −1 0 1 2 3 4 5 6

logP

0.0

0.2

0.4

0.6

0.8

1.0

1.2

PK

Slin

ear

π(θ|y

)

−6 −4 −2 0 2 4 6

log δ

0.0

0.5

1.0

1.5

2.0

2.5

2 3 4 5 6 7 8 9 10

logN0

0.00.20.40.60.81.01.21.41.6

Figure 13: The posterior distributions inferred by the different algorithms onthe blowfly problem. A burn-in value of 2000 was used. samples.

40

Hence the global methods achieve roughly the same posterior distribution,

while only using a fraction of the costs.

When examining the posteriors of SL-ABC and comparing this with the

NMSE results, it seems that the initial poor convergence does not have much

effect on the inferred posterior distributions.

41

5 Conclusion and Discussion

The goal of this thesis was to build non-parametric ABC methods that take

advantage of the simulation history, have similar performance to existing

methods and are easy to use. We proposed three new ABC algorithms that

are all based on kernel methods. They are the non-parametric counterparts

to existing parametric methods.

Contrary to some existing methods, there are few hyperparameters that

need to be set. For example for methods that rely Gaussian processes (GPS-

ABC and Wilkinsons likelihood surrogate) need to elicit different hyperpa-

rameters including lengthscales, roughness, different variance terms and the

covariance matrix. Setting these requires specific knowledge of the effect

of the parameters, which is non-trivial. The global methods proposed in

this thesis only have two hyperparameters, for which we have provided some

guidelines. Therefore, we believe the proposed methods are easy to use.

In the first experiment it was shown that kernel likelihood ABC variants

perform on par with synthetic likelihood counterparts when the underlying

density is Gaussian. Then in the second experiment the kernel based methods

outperformed the SL variants when the underlying density is not Gaussian.

Therefore, at least for low dimensional outputs, KL-ABC and AKL-ABC

seem superior.

An interesting observation is that PSS-ABC had difficulty modelling the

exponential simulator curve. This is because PSS is more sensitive to out-

liers. In the multimodal setting, where there are fewer outliers, PSS-ABC

performed on-par with its local counterparts.

We have shown that the global methods outperformed the local algo-

rithms, in terms of simulation calls. Similar performance is obtained, while

needing orders of magnitude fewer simulation calls. Hence proposed algo-

rithms perform at least as good as existing ones and even better in terms of

simulation calls.

One of the strengths of surrogate methods is that eventually no additional

simulation calls need to be done. However, in none of the performed experi-

ments this was the case. We think this is because of the adapting bandwidth.

42

For each added training point the bandwidth is recomputed. We think that

if instead the bandwidth is only updated every N training points or with a

cooling schedule, the point where no more simulations are performed should

be reached.

One suboptimal property of both PKS and PSS is that they assume

independence of the different output dimensions. This can be resolved with

the use of multivariate multiple regression or generalized linear models [15,

28]. Although the performance does not seem to suffer from this on the toy

problems we used, in more challenging problems it may cause problems.

Another improvement of the algorithms proposed in this thesis could be

the method by which new training locations are selected. We selected either

the current parameter location of the Markov chain or the proposed location,

each with 50% probability. Maybe these training points can be acquired in

a more sophisticated manner. For example an active learning [8] approach

where the probe location is not only optimised for the current decision, but

also for decisions later in the chain.

In summary, interesting topics for future work include:

• An analysis of the effects of bandwidth adaptation schemes on the

number of simulation calls.

• Building an ABC algorithm that employs generalized linear models to

address the covariance of the different output dimensions, instead of

our assumption of independence.

• Using an acquisition function to obtain new training samples that min-

imize uncertainty in the surrogate. It is interesting to see if this will

lead to even fewer simulation calls.

• Extending the ideas put forward in this thesis to population methods.

A surrogate function can be used for all particles, however a parallel

implementation of this is non-trivial.

43

References

[1] Simon Aeschbacher, Mark A Beaumont, and Andreas Futschik. A novel

approach for choosing summary statistics in approximate bayesian com-

putation. Genetics, 192(3):1027–1047, 2012.

[2] C Andrieu, A Doucet, and GO Roberts. The expected auxiliary variable

method for monte carlo simulation. Unpublished paper, 2007.

[3] Christophe Andrieu and Gareth O Roberts. The pseudo-marginal ap-

proach for efficient monte carlo computations. The Annals of Statistics,

pages 697–725, 2009.

[4] Mark A Beaumont, Wenyang Zhang, and David J Balding. Approximate

bayesian computation in population genetics. Genetics, 162(4):2025–

2035, 2002.

[5] Michael GB Blum and Olivier Francois. Non-linear regression models for

approximate bayesian computation. Statistics and Computing, 20(1):63–

73, 2010.

[6] Adrian W Bowman. An alternative method of cross-validation for the

smoothing of density estimates. Biometrika, 71(2):353–360, 1984.

[7] William S. Cleveland. Robust locally weighted regression and smooth-

ing scatterplots. Journal of the American Statistical Association,

74(368):pp. 829–836, 1979.

[8] David A Cohn, Zoubin Ghahramani, and Michael I Jordan. Active

learning with statistical models. arXiv preprint cs/9603104, 1996.

[9] Peter S Craig, Michael Goldstein, Allan H Seheult, and James A Smith.

Pressure matching for hydrocarbon reservoirs: a case study in the use of

bayes linear strategies for large computer experiments. In Case studies

in Bayesian statistics, pages 37–93. Springer, 1997.

44

[10] Katalin Csillery, Michael GB Blum, Oscar E Gaggiotti, and Olivier

Francois. Approximate bayesian computation (abc) in practice. Trends

in ecology & evolution, 25(7):410–418, 2010.

[11] Bradley Efron. Bootstrap methods: another look at the jackknife. The

annals of Statistics, pages 1–26, 1979.

[12] Paul Fearnhead and Dennis Prangle. Constructing summary statistics

for approximate bayesian computation: semi-automatic approximate

bayesian computation. Journal of the Royal Statistical Society: Series

B (Statistical Methodology), 74(3):419–474, 2012.

[13] Xizhou Feng, Duncan A Buell, John R Rose, and Peter J Waddell. Par-

allel algorithms for bayesian phylogenetic inference. Journal of Parallel

and Distributed Computing, 63(7):707–718, 2003.

[14] C. V. Fiorio. Confidence intervals for kernel density estimation. Stata

Journal, 4(2):168–179(12), 2004.

[15] Joseph F Hair, Ronald L Tatham, Rolph E Anderson, and William

Black. Multivariate data analysis, volume 6. Pearson Prentice Hall

Upper Saddle River, NJ, 2006.

[16] M Chris Jones, James S Marron, and Simon J Sheather. A brief survey

of bandwidth selection for density estimation. Journal of the American

Statistical Association, 91(433):401–407, 1996.

[17] MC Jones. Variable kernel density estimates and variable kernel density

estimates. Australian Journal of Statistics, 32(3):361–371, 1990.

[18] Paul Joyce and Paul Marjoram. Approximately sufficient statistics and

bayesian computation. Statistical Applications in Genetics and Molecu-

lar Biology, 7(1), 2008.

[19] Jonathan R Karr, Jayodita C Sanghvi, Derek N Macklin, Miriam V

Gutschow, Jared M Jacobs, Benjamin Bolival Jr, Nacyra Assad-Garcia,

John I Glass, and Markus W Covert. A whole-cell computational model

predicts phenotype from genotype. Cell, 150(2):389–401, 2012.

45

[20] Marc C Kennedy, Clive W Anderson, Stefano Conti, and Anthony OHa-

gan. Case studies in gaussian process modelling of computer codes.

Reliability Engineering & System Safety, 91(10):1301–1309, 2006.

[21] Marc C Kennedy and Anthony O’Hagan. Bayesian calibration of com-

puter models. Journal of the Royal Statistical Society: Series B (Statis-

tical Methodology), 63(3):425–464, 2001.

[22] David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. Markov

chains and mixing times. American Mathematical Soc., 2009.

[23] Clive R Loader. Bandwidth selection: classical or plug-in? Annals of

Statistics, pages 415–438, 1999.

[24] Paul Marjoram, John Molitor, Vincent Plagnol, and Simon Tavare.

Markov chain monte carlo without likelihoods. Proceedings of the Na-

tional Academy of Sciences, 100(26):15324–15328, 2003.

[25] Edward Meeds and Max Welling. Gps-abc: Gaussian process surrogate

approximate bayesian computation. arXiv preprint arXiv:1401.2838,

2014.

[26] Elizbar A Nadaraya. On estimating regression. Theory of Probability &

Its Applications, 9(1):141–142, 1964.

[27] Shigeki Nakagome, Kenji Fukumizu, and Shuhei Mano. Kernel approxi-

mate bayesian computation in population genetic inferences. Statistical

applications in genetics and molecular biology, 12(6):667–678, 2013.

[28] John A Nelder and RJ Baker. Generalized linear models. Wiley Online

Library, 1972.

[29] Matthew A Nunes and David J Balding. On optimal selection of sum-

mary statistics for approximate bayesian computation. Statistical appli-

cations in genetics and molecular biology, 9(1), 2010.

46

[30] Byeong U Park and James S Marron. Comparison of data-driven

bandwidth selectors. Journal of the American Statistical Association,

85(409):66–72, 1990.

[31] Emanuel Parzen. On estimation of a probability density function and

mode. The annals of mathematical statistics, pages 1065–1076, 1962.

[32] Saman Razavi, Bryan A Tolson, and Donald H Burn. Review of sur-

rogate modeling in water resources. Water Resources Research, 48(7),

2012.

[33] Murray Rosenblatt et al. Remarks on some nonparametric estimates of a

density function. The Annals of Mathematical Statistics, 27(3):832–837,

1956.

[34] Donald B Rubin et al. Bayesianly justifiable and relevant frequency

calculations for the applies statistician. The Annals of Statistics,

12(4):1151–1172, 1984.

[35] David W Scott. On optimal and data-based histograms. Biometrika,

66(3):605–610, 1979.

[36] Simon J Sheather and Michael C Jones. A reliable data-based bandwidth

selection method for kernel density estimation. Journal of the Royal

Statistical Society, Series B, 53(3):683–690, 1991.

[37] Chris Sherlock, Alexandre H Thiery, Gareth O Roberts, and Jeffrey S

Rosenthal. On the efficiency of pseudo-marginal random walk metropolis

algorithms. arXiv preprint arXiv:1309.7209, 2013.

[38] Bernard W Silverman. Density estimation for statistics and data anal-

ysis, volume 26. CRC press, 1986.

[39] Scott A Sisson and Yanan Fan. Likelihood-free markov chain monte

carlo. arXiv preprint arXiv:1001.2058, 2010.

[40] George R Terrell and David W Scott. Variable kernel density estimation.

The Annals of Statistics, pages 1236–1265, 1992.

47

[41] Brandon M Turner and Per B Sederberg. A generalized, likelihood-

free method for posterior estimation. Psychonomic bulletin & review,

21(2):227–250, 2014.

[42] Brandon M Turner and Trisha Van Zandt. A tutorial on approximate

bayesian computation. Journal of Mathematical Psychology, 56(2):69–

85, 2012.

[43] Geoffrey S Watson. Smooth regression analysis. Sankhya: The Indian

Journal of Statistics, Series A, pages 359–372, 1964.

[44] Simon R White, Theodore Kypraios, and Simon P Preston. Fast approx-

imate bayesian computation for discretely observed markov models us-

ing a factorised posterior distribution. arXiv preprint arXiv:1301.2975,

2013.

[45] Richard D Wilkinson. Accelerating abc methods using gaussian pro-

cesses. arXiv preprint arXiv:1401.1436, 2014.

[46] Simon N Wood. Statistical inference for noisy nonlinear ecological dy-

namic systems. Nature, 466(7310):1102–1104, 2010.

48

A Kernel Methods

In the broad sense kernel methods are all algorithms that employ kernel

functions to compute or estimate functions. A kernel function or kernel

for short, is a real-valued integrable function k(u) satisfying the following

requirement: ∫ ∞−∞

k(u) du = 1 (26)

Where u is a real numbered input to the function.

We only consider symmetric and non-negative kernels:

∀u [k(u) = k(−u)] (27)

∀u [k(u) ≥ 0] (28)

A non-negative symmetric kernel can be viewed more intuitively as a weighted

distance function.

Some commonly used kernels are the Gaussian kernel and Epanechnikov

kernel given in equations (29) and (30)

k(u) =1√2π

exp

[−1

2u2]

(29)

k(u) =3

4(1− u2) 1(|u| ≤ 1) (30)

Where 1(·) is the indicator function.

Often a bandwidth h is associated with the kernel, denoted kh:

kh(u) = h−1k(h−1u) (31)

For example the Gaussian kernel with bandwidth h becomes:

k(u) =1

h√

2πexp

[− u2

2h2

]The bandwidth governs how dispersed a kernel function is. With a high

49

bandwidth, the kernel becomes broader, thus effectively incorporating more

data. Whereas a small bandwidth provides a narrower kernel and hence a

more local estimate. In practice the bandwidth depends on the data you are

working with. Bandwidth selection is described in section A.2.

The kernel functions described above work in one dimension. To cope

with multiple dimensions, a multiplicative kernel is employed:

kh(u) =D∏d

khd(ud) (32)

Where hd is the bandwidth associated with dimension d and u is the d-

dimensional input vector.

A.1 Kernel Density Estimation

The simplest kernel methods are kernel density estimators (KDEs), discov-

ered independently by Rosenblatt [33] and Parzen [31]. Therefore KDE is

sometimes referred to as the Parzen-Rosenblatt window method.

With KDE, the density function of an unknown distribution is estimated

using samples from that distribution. This is done by placing a kernel at

the center of each sample. To compute the density at a target point x?

the kernel weights at that location are summed and divided by the number

of samples (in order to normalize the resulting probability density function).

This estimation is illustrated in figure 14. The result is the estimated density

π(x?).

π(x?) = KDE(x? | k, h,X) =1

N

N∑n=1

kh(x? − xn) (33)

Where N is the number of training samples, xn is the nth training sample

from X and x? is the location of the estimate.

Instead of placing kernels with equal weight at each training sample, a

weighted kernel density estimation can be contructed by associating a weight

wn with each sample xn. The estimate then becomes:

50

0 5 10 15 200.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Figure 14: Example of placing kernels at each sample and summing theweights.

π(x?) = KDE(x? | k, h,w,X) =1

W

N∑n=1

wn · kh(x? − xn) (34)

Where W is the sum of all weights wn and w is the vector containing all

weights. A weighted estimate is useful when it is known beforehand some

samples are more important than others. We use a weighted KDE to con-

struct a non-parametric surrogate method in section 3.3.

A.2 Bandwidth Selection

Almost always a bandwidth is used to cope with the scale of the data involved.

If single data points are several hundreds of units apart a kernel that is only

one unit wide will be ineffective.

The hard part in KDE is selecting an appropriate bandwidth for the

problem. On the one hand when the bandwidth is too small, the estimate is

overfitting. On the other hand, when it is too big, it oversmoothes the esti-

mate and might miss multiple modes. These two extremes are illustrated in

51

True densityh too smallh optimalh too large

Figure 15: Illustration of extreme bandwidths on the same data.

figure 15. Note that the bandwidth of 2.0 does not capture the multimodal-

ity of the underlying density. On the other the small bandwidth has far to

many modes. The bandwidth of 0.85 is the optimal bandwidth computed

using Silverman’s rule, which will be defined shortly.

The selection of an appropriate bandwidth has been studied intensively

[6, 16, 30]. There are two kinds of methods to compute the bandwidth [23]:

Classical methods and Plug-in estimates.

Classical methods minimize some error criterion. For example, Silverman

[38] minimizes the averaged mean integrated squared error (AMISE) while

using a Gaussian kernel, which results in the following “rule of thumb”:

h?i =

(4n

d+ 2

) 1d+4

· σi (35)

Where d is the dimensionality of the data, n is the number of data points

52

and σi is the standard deviation in dimension i. Other methods that belong

to the classical method group are: Cross-validation, Akaike’s Information

Criterion, Scott’s rule of thumb. [6, 23,35].

The characteristic of a plug-in estimate is that it computes an estimate

f of the (derivatives of the) true probability density function f . Then this

estimate is plugged into the equations for some error criterion. To obtain

the optimal bandwidth this error is then minimized. The most common

plug-in estimate is the Sheather-Jones (SJ) estimator [36]. It computes the

second order derivative using a kernel estimate with Silverman’s rule for the

bandwidth. Then this estimate can be plugged into the equation for the

analytical optimal bandwidth to obtain an approximation of the optimal

bandwidth.

While the Sheather-Jones estimator has been shown to have better per-

formance on a number of datasets, the computational cost of the method

compared to the rules of thumb is relatively high.

A.3 Adaptive Kernel Density Estimation

The classical kernel density estimators use one bandwidth for the entire sam-

ple space. In general this is not optimal; in densely populated regions, a small

bandwidth is preferable, whereas in sparse region a broader bandwidth is de-

sired. Adaptive kernel density estimation (AKDE) adapts the bandwidth

depending on some criterion. There are two kinds of adaptive methods:

Sample point estimators and Balloon estimators

The sample point estimators vary the bandwidth hi for each sample point

xi. The bandwidth hi is a function of the distance to neighbouring samples.

For example the k-nearest neighbour sample point estimator considers the

distance from each sample to its kth nearest neighbor. When k = 1 the

formula for the bandwidth of training point i becomes:

hsp(xi) = minn 6=i

d(xi,xn) (36)

Where d(a, b) is the euclidean distance from a to b.

53

Whereas the sample point estimators give individual bandwidths to all

samples, balloon estimators keep the bandwidths the same for all samples.

The bandwidth adaptation is based on the estimate location and its distance

to the training points. That is in sparse regions the bandwidth for all sam-

ples is high, whereas in dense regions it is small. To determine how to set

the bandwidth different criteria are used. One option is to inflate the band-

width until k samples are (effectively) included in the estimate.5 Hence the

bandwidth is set to the distance to the kth nearest neighbor of the query

location. When k = 1 the formula for h becomes:

hballoon(x?) = minnd(x?,xn) (37)

It has been shown that adaptive methods are superior in higher dimen-

sional spaces [40]. For a review of the differences between the two types

see [17].

In our new methods the sample point estimators are used because the

weights only need to be updated when additional samples are obtained.

When a set of samples is fixed, the individual bandwidths do not have to

be recomputed any more, whereas this is the case for balloon estimators.

This is in particular useful for the surrogate methods (section 3.3), which

eventually stop adding samples.

A.4 Kernel regression

Instead of estimating a density, functions can also be approximated using

kernels. Kernel regression is derived from kernel density estimation and the

regression model is:

yn = µ(xn) + εn (38)

Where xn are the data points and yn their corresponding function values.

The mean function µ is unknown and εn are zero mean Gaussian noise terms.

To derive the regression estimator the mean function is expressed in terms

5“Effectively” because with a Gaussian kernel all samples get an non-zero weight butmost of them are near zero.

54

of the joint probability π(x,y):

µ(x) = E [y | x] =

∫ ∞−∞

y π(y | x) dy =

∫∞−∞ yπ(x,y) dy∫∞−∞ π(x,y) dy

(39)

If the joint probability is estimated using a multiplicative kernel density

estimate, we obtain:

π(x,y) =1

N

N∑n=1

khx(x− xn)khy(y − yn) (40)

Where hx and hy are the bandwidths for the kernels in the x and y direction

respectively. Using the properties of a symmetric kernel, the integrals in

equation (39) can be simplified:

∫ ∞−∞

yπ(x,y) dy =1

N

∫ ∞−∞

yN∑n=1

khx(x− xn)khy(y − yn) dy (41)

=1

N

N∑n=1

khx(x− xn)yn (42)

∫ ∞−∞

π(x,y) dy =1

N

∫ ∞−∞

N∑n=1

khx(x− xn)khy(y − yn) dy (43)

=1

N

N∑n=1

khx(x− xn) (44)

Where we used the fact that we use a symmetric kernel in (42) and the fact

that the kernel integrates to 1 in (44).

The resulting kernel regression method is the Nadaraya-Watson (NW)

estimator [26,43]. The NW estimate is given by:

f(x?) = µ(x?) =

∑Nn=1 kh(xn − x?) · yn∑Nn=1 kh(xn − x?)

(45)

This estimation is a weighted average and hence is an estimate of the

mean of the function at x?. In a similar fashion an estimate for the variance

55

can be constructed:

σ2 =

∑Nn=1 kh(xn − x?) · (yn − µ(x?))2∑N

n=1 kh(xn − x?)(46)

These formulas are used in section 3.3.1 to construct a surrogate function.

56

B Pseudocode for SL and ASL

For better readability, the pseudocode for both the SL and ASL algorithms

are shown here.

Algorithm 7 The MH-step for SL-ABC.

1: procedure SL-ABC MH-step(q,θ, π(θ), π(x | θ),y?, S)2: θ′ ∼ q(θ′ | θ)3: for s← 1 to S do4: xs

sim∼ π(x | θ)


6: end for7: Estimate µ, Σ and µ′, Σ′ using equations (8) and (9)8: Set α using equation (10)9: if U(0, 1) ≤ α then

10: return θ′


57

Algorithm 8 The MH-step for ASL-ABC.

1: procedure ASL-ABC MH-step(q,θ, π(θ), π(x | θ),y?, S0,∆S, ξ,M)2: θ′ ∼ q(θ′ | θ)3: S ← 04: c← S0

5: repeat6: for s← S to S + c do7: xs

sim∼ π(x | θ)


9: end for10: S ← S + c11: c← ∆S12: Estimate µ, Σ and µ′, Σ′ using equations (8) and (9)13: for m← 1 to M do14: µm ∼ N (µ, S−1Σ)15: µ′m ∼ N (µ′, S−1Σ′)16: Set αm using equation (10)17: end for18: τ ← median(α)19: Set E(α) using equation (12)20: until E(α) ≤ ξ21: if U(0, 1) ≤ τ then22: return θ′


58

C Kernel choice and bandwidth selection

The performance of the different kernel methods proposed in this thesis,

depends on the bandwidth and kernel selection. In this section a couple of

remarks and suggestions are made about the selection of both.

C.1 Kernel choice

For the kernel in the y-direction it is desirable to have a kernel function with

infinite support, such as the Gaussian kernel. To see why this is the case,

image a parameter location that yields simulator values that are very far from

the observed ones. In other words at θ the simulated y’s are different to y?.

Next a new parameter location θ′ is proposed, which is still not in proximity

of the observed value. If we were to employ a kernel with fixed support,

the probabilities π(y? | θ, x) and π(y? | θ′, x) will both be zero. Hence,

the decision whether to reject or accept a proposed parameter location, will

depend exclusively on the prior.6 Consequently, the algorithm becomes more

like rejection ABC in terms of simulation costs.

Thus it is desirable to choose only kernels for the y-direction that have

infinite support, for example the Gaussian kernel.

C.2 Bandwidth selection

In the optimal scenario the bandwidth selection method, deals perfectly with

the scale of the data. However, when using rules of thumb, this is seldom

the case. It should be noted that more sophisticated bandwidth selection

algorithms can be used, such as the Sheather-Jones estimator.

To illustrate the difference, the performance of the KL-ABC algorithm

using two different bandwidth selection methods was tested on two problems.

The problems are the two toy problems from section 4: the exponential

problem and the multimodal problem. For both problems the used kernel

6If a symmetric proposal is used and two probabilities are cancelled against each other.If you do not build a special case for this 0/0 case, the acceptance probability is undefined,which is even worse.

59

100 101 102 103 104


0.2

0.4

0.6

0.8

1.0

Tota

lVar

iati

onD

ista

nce

ExponentialKL SilvermanKL SJ

100 101 102 103 104


0.2

0.4

0.6

0.8

1.0

Tota

lVar

iati

onD

ista

nce

Multimodal

KL SilvermanKL SJ

Figure 16: The variation distance of KL-ABC using different bandwidthselection methods. Left the exponential problem, right the multimodal prob-lem.

was the Gaussian kernel and S was set to 20.

The results are shown in figure 16. As Silverman’s rule of thumb as-

sumes an underlying Gaussian density, it behaves poorly on the multimodal

problem. While in the exponential problem, where the underlying density is

Gaussian, both bandwidth selectors have similar performance. It should be

noted though that the Sheather-Jones estimator is computationally more ex-

pensive (O(n2) compared to O(n) for Silverman’s rule, where n is the number

of training points).

Instead of using this costly estimator, we can also manually make the

Gaussian smaller and thus act like the bandwidth is actually smaller than

the rule of thumb selected. This gives the parametrized Gaussian kernel:

kσ(u) =1√2πσ

exp

[− 1

2σ2u2]

(47)

Note that this is equivalent to multiplying the computed bandwidth by a

fixed value σ2.

The influence of different settings of σ on the total variation distance

60

100 101 102 103 104


0.2

0.4

0.6

0.8

1.0

Tota

lVar

iati

onD

ista

nce

MultimodalAKL σ2 = 1

AKL σ2 = 12

AKL σ2 = 14

Figure 17: The effect of smaller sigmas when using Silverman’s rule of thumb.

are shown in figure 17. For each experiment Silverman’s rule of thumb was

used for bandwidth selection. This way, similar performance to the more

expensive estimator can be obtained, while avoiding the costs. The optimal

value of σ2 depends on the problem and can therefore not be set beforehand

perfectly, but we found that a value of 1/4 generally works fine.

61

Non-parametric Approximate Bayesian Computation for ... · Non-parametric Approximate Bayesian Computation for Expensive Simulators Steven Laan 6036031 Master’s thesis 42 EC Master’s

Documents