Approximate Bayesian computation (ABC) 1cm NIPS Tutorial · Approximate Bayesian computation (ABC) NIPS Tutorial Richard Wilkinson [email protected] School of Mathematical

Approximate Bayesian computation (ABC)

NIPS Tutorial

Richard [email protected]

School of Mathematical SciencesUniversity of Nottingham

December 5 2013

Computer experiments

Rohrlich (1991): Computer simulation is

‘a key milestone somewhat comparable to the milestone thatstarted the empirical approach (Galileo) and the deterministicmathematical approach to dynamics (Newton and Laplace)’

Challenges for statistics:How do we make inferences about the world from a simulation of it?

how do we relate simulators to reality? (model error)

how do we estimate tunable parameters? (calibration)

how do we deal with computational constraints? (stat. comp.)

how do we make uncertainty statements about the world thatcombine models, data and their corresponding errors? (UQ)

Computer experiments

Rohrlich (1991): Computer simulation is

‘a key milestone somewhat comparable to the milestone thatstarted the empirical approach (Galileo) and the deterministicmathematical approach to dynamics (Newton and Laplace)’

Challenges for statistics:How do we make inferences about the world from a simulation of it?

how do we relate simulators to reality? (model error)

how do we estimate tunable parameters? (calibration)

how do we deal with computational constraints? (stat. comp.)

how do we make uncertainty statements about the world thatcombine models, data and their corresponding errors? (UQ)

Calibration

For most simulators we specify parameters θ and i.c.s and thesimulator, f (θ), generates output X .

We are interested in the inverse-problem, i.e., observe data D, wantto estimate parameter values θ which explain this data.

For Bayesians, this is aquestion of finding theposterior distribution

π(θ|D) ∝ π(θ)π(D|θ)

posterior ∝prior× likelihood

Intractability

π(θ|D) =π(D|θ)π(θ)

π(D)

usual intractability in Bayesian inference is not knowing π(D).

a problem is doubly intractable if π(D|θ) = cθp(D|θ) with cθunknown (cf Murray, Ghahramani and MacKay 2006)

a problem is completely intractable if π(D|θ) is unknown and can’tbe evaluated (unknown is subjective). I.e., if the analytic distributionof the simulator, f (θ), run at θ is unknown.

Completely intractable models are where we need to resort to ABCmethods

Common exampleTanaka et al. 2006, Wilkinson et al. 2009, Neal and Huang 2013 etc

Many models have unobserved branching processes that lead to the datamaking calculation difficult. For example, the density of the cumulativeprocess is unknown in general.

Approximate Bayesian Computation (ABC)

Given a complex simulator for which we can’t calculate the likelihoodfunction - how do we do inference?

If its cheap to simulate, then ABC (approximate Bayesian computation)isone of the few approaches we can use.

ABC algorithms are a collection of Monte Carlo methods used forcalibrating simulators

they do not require explicit knowledge of the likelihood function

inference is done using simulation from the model (they are‘likelihood-free’).

Approximate Bayesian Computation (ABC)

Given a complex simulator for which we can’t calculate the likelihoodfunction - how do we do inference?

If its cheap to simulate, then ABC (approximate Bayesian computation)isone of the few approaches we can use.

ABC algorithms are a collection of Monte Carlo methods used forcalibrating simulators

they do not require explicit knowledge of the likelihood function

inference is done using simulation from the model (they are‘likelihood-free’).


ABC methods are primarily popular in biological disciplines, particularlygenetics and epidemiology, and this looks set to continue growing.

Simple to implement

Intuitive

Embarrassingly parallelizable

Can usually be applied

ABC methods can be crude but they have an important role to play.

First ABC paper candidates

Beaumont et al. 2002

Tavare et al. 1997 or Pritchard et al. 1999

Or Diggle and Gratton 1984 or Rubin 1984

. . .


ABC methods are primarily popular in biological disciplines, particularlygenetics and epidemiology, and this looks set to continue growing.

Simple to implement

Intuitive

Embarrassingly parallelizable

Can usually be applied

ABC methods can be crude but they have an important role to play.

First ABC paper candidates


Tavare et al. 1997 or Pritchard et al. 1999

Or Diggle and Gratton 1984 or Rubin 1984

. . .

Tutorial Plan

Part I

i. Basics

ii. Efficient algorithms

iii. Links to other approaches

Part II

iv. Regression adjustments/ post-hoc corrections

v. Summary statistics

vi. Accelerating ABC using Gaussian processes

Basics

‘Likelihood-Free’ Inference

Rejection Algorithm

Draw θ from prior π(·)Accept θ with probability π(D | θ)

Accepted θ are independent draws from the posterior distribution,π(θ | D).

If the likelihood, π(D|θ), is unknown:

‘Mechanical’ Rejection Algorithm

Draw θ from π(·)Simulate X ∼ f (θ) from the computer model

Accept θ if D = X , i.e., if computer output equals observation

The acceptance rate is∫P(D|θ)π(θ)dθ = P(D).

The number of runs to get n observations is negative binomial, with meann

P(D) : ⇒ Bayes Factors!

‘Likelihood-Free’ Inference

Rejection Algorithm

Draw θ from prior π(·)Accept θ with probability π(D | θ)

Accepted θ are independent draws from the posterior distribution,π(θ | D).If the likelihood, π(D|θ), is unknown:

‘Mechanical’ Rejection Algorithm

Draw θ from π(·)Simulate X ∼ f (θ) from the computer model

Accept θ if D = X , i.e., if computer output equals observation

The acceptance rate is∫P(D|θ)π(θ)dθ = P(D).

The number of runs to get n observations is negative binomial, with meann

P(D) : ⇒ Bayes Factors!

Rejection ABC

If P(D) is small (or D continuous), we will rarely accept any θ. Instead,there is an approximate version:

Uniform Rejection Algorithm

Draw θ from π(θ)

Simulate X ∼ f (θ)

Accept θ if ρ(D,X ) ≤ ε

This generates observations from π(θ | ρ(D,X ) < ε):

As ε→∞, we get observations from the prior, π(θ).

If ε = 0, we generate observations from π(θ | D).

ε reflects the tension between computability and accuracy.

For reasons that will become clear later, we call this uniform-ABC.

Rejection ABC

If P(D) is small (or D continuous), we will rarely accept any θ. Instead,there is an approximate version:

Uniform Rejection Algorithm

Draw θ from π(θ)



This generates observations from π(θ | ρ(D,X ) < ε):

As ε→∞, we get observations from the prior, π(θ).

If ε = 0, we generate observations from π(θ | D).

ε reflects the tension between computability and accuracy.

For reasons that will become clear later, we call this uniform-ABC.

ε = 10

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−3 −2 −1 0 1 2 3

−10

010

20

theta vs D

theta

D

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●● ●

●●

●

●

●

●

●

●

● ● ●●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●● ●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

● ●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●●●

●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●● ●

●

● ●

●●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

− ε

+ ε

D

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Density

theta

Den

sity

ABCTrue

θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)

ρ(D,X ) = |D − X |, D = 2

ε = 7.5

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●● ●

●

●

● ●●

●

●

●

●

●

●

● ●

●

● ●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●●●

●

●

●●

●

●

●●●

●

●●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

−3 −2 −1 0 1 2 3

−10

010

20

theta vs D

theta

D

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●●● ●

●●

●

●

●

●

●

●

● ● ●●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●● ●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●● ●

●●

●

●

●

●

●

●●

●●

●

●

●

●●

●●●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●●

●●

●

●

●

●

● ●

●●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●● ●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

● ●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●●

●

●

●●

●●

●●

●● ●

● ●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

− ε

+ ε

D

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Density

theta

Den

sity

ABCTrue

ε = 5

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

● ●●

●

●

●

●

●● ●

● ●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

● ●●●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●●

●

●

● ●

● ●

●

●

●●

●

●●● ●

●●

●

●

●

●

●

●

●●

●

●●●

●

●

●

● ●●

●

●

●

●

●●

●

●●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

−3 −2 −1 0 1 2 3

−10

010

20

theta vs D

theta

D

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●●

●

●

●●● ●

●●

●

● ●

●

● ● ●●

●

●●

●●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●●●

●● ●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●● ●

●●

●

●

●

●●

●● ●

●

●●

●●●●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●●

●

●●

●

●

●●

●

●

●

● ●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

● ●● ●

●

●●

●●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

● ●

●

●●●

●

●

●●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●●

●

●

●

●●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●●

●

●● ●

●

●●

●●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●●

●

●

●

●

●

− ε

+ ε

D

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Density

theta

Den

sity

ABCTrue

ε = 2.5

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●● ●

●

●

●●

●

●

●

●

●

●

●

●

●

● ● ●●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

● ●●

●

●

●●

●●

●

●

●

●

●

●

●

●

● ●

●●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●●

● ●

●

●

●

●

● ●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●● ●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●●●

●

●

●●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●●

●

●

●

●●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

−3 −2 −1 0 1 2 3

−10

010

20

theta vs D

theta

D

●

●●

●

●●

●

●

●

●

●●

●●●●

●

● ●●

●

●●

●●

●●

●●●●

●

●●●

● ●●

●●

●●

●

● ●

●

●

●●

●

●●●

●

●

●●

●

●●

●●

●●

●

●●

●●●●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●● ●

●

●

●

●

●

●●●

●

●●

●

●

●●

●●●

●

●

●

●

●●

●

●

●●

●

●

●

●●●

●●

●●

●

●

●●

●

●

●

●●●

●

●●●

●

●

●●●

● ●●

●

●●

●●●

●●

●●

●●

●●

●

●●●●

●●

●

●

●●

●●

●

●

●

●

●●●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●● ●

●

●●

●

●●

●● ●

●●

●●

●

●●●

●

●●

●●

●

●

●

●●

● ●

●

●

●

●

●

●

●●

●●

●

●● ●

●

●●

●

●

− ε

+ ε

D

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Density

theta

Den

sity

ABCTrue

ε = 1

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ● ●●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

● ●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●● ●

●

●

● ●

●●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

−3 −2 −1 0 1 2 3

−10

010

20

theta vs D

theta

D ●●● ●●●

● ●●●●●

●● ●

●●

●●● ●●● ●

●●

●●

●●

●●

●●●●●

●

●●●●●

●

●●●

●●●

● ●●●●

● ●●●●

●●

●●

●●

●●●●●

●●

●●

●●

●● ●● ●

● ●● ●●●

●

● ●● ● ●

●●●●●

− ε

+ ε

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Density

theta

Den

sity

ABCTrue

Rejection ABC

If the data are too high dimensional we never observe simulations that are‘close’ to the field data - curse of dimensionalityReduce the dimension using summary statistics, S(D).

Approximate Rejection Algorithm With Summaries

Draw θ from π(θ)


Accept θ if ρ(S(D), S(X )) < ε

If S is sufficient this is equivalent to the previous algorithm.

Simple → Popular with non-statisticians

Rejection ABC

If the data are too high dimensional we never observe simulations that are‘close’ to the field data - curse of dimensionalityReduce the dimension using summary statistics, S(D).

Approximate Rejection Algorithm With Summaries

Draw θ from π(θ)


Accept θ if ρ(S(D), S(X )) < ε


Simple → Popular with non-statisticians

Two ways of thinking

We think about linear regression in two ways

Algorithmic: find the straight line that minimizes the sum of squarederrors

Probabilistic: a linear model with Gaussian errors fit using MAPestimates.

Kalman filter:

Algorithmic: linear quadratic estimation - find the best guess at thetrajectory using linear dynamics and a quadratic penalty function

Probabilistic: the (Bayesian) solution to the linear Gaussian filteringproblem.

The same dichotomy exists for ABC.

Algorithmic: find a good metric, tolerance and summary etc

Probabilistic: What model does ABC correspond to, and how shouldthis inform our choices?





Kalman filter:










Kalman filter:






Modelling interpretation - Calibration frameworkWilkinson 2008/2013

We can show that ABC is “exact”, but for a different model to thatintended.πABC (D|θ) is not just the simulator likelihood function:

πABC (D|θ) =

∫πε(D|x)π(x |θ)dx

πε(D|x) is a pdf relating the simulator output to reality - call it theacceptance kernel.π(x |θ) is the likelihood function of the simulator (ie not relating toreality)

Common way of thinking (Kennedy and O’Hagan 2001):

Relate the best-simulator run (X = f (θ)) to reality ζRelate reality ζ to the observations D.

θ f (θ) ζ D

sim error meas error

December 3, 2013 1 / 1

Calibration framework

The posterior is

πABC (θ|D) =1

Z

∫πε(D|x)π(x |θ)dx. π(θ)

where Z =∫∫

πε(D|x)π(x |θ)dxπ(θ)dθ

To simplify matters, we can work in joint (θ, x) space

πABC (θ, x |D) =πε(D|x)π(x |θ)π(θ)

Z

NB: we can allow πε(D|X ) to depend on θ.

Calibration framework

The posterior is

πABC (θ|D) =1

Z

∫πε(D|x)π(x |θ)dx. π(θ)

where Z =∫∫

πε(D|x)π(x |θ)dxπ(θ)dθ

To simplify matters, we can work in joint (θ, x) space

πABC (θ, x |D) =πε(D|x)π(x |θ)π(θ)

Z

NB: we can allow πε(D|X ) to depend on θ.

How does ABC relate to calibration?

Consider how this relates to ABC:

πABC (θ, x) := π(θ, x |D) =πε(D|x)π(x |θ)π(θ)

Z

Lets sample from this using the rejection algorithm with instrumentaldistribution

g(θ, x) = π(x |θ)π(θ)

Note: supp(πABC ) ⊆ supp(g) and that there exists a constant

M = maxx π(D|X )Z such that

πABC (θ, x) ≤ Mg(θ, x) ∀ (θ, x)

How does ABC relate to calibration?

Consider how this relates to ABC:

πABC (θ, x) := π(θ, x |D) =πε(D|x)π(x |θ)π(θ)

Z

Lets sample from this using the rejection algorithm with instrumentaldistribution

g(θ, x) = π(x |θ)π(θ)

Note: supp(πABC ) ⊆ supp(g) and that there exists a constant

M = maxx π(D|X )Z such that

πABC (θ, x) ≤ Mg(θ, x) ∀ (θ, x)

Generalized ABC (GABC)Wilkinson 2008, Fearnhead and Prangle 2012

The rejection algorithm then becomes

Generalized rejection ABC (Rej-GABC)

1 θ ∼ π(θ) and X ∼ π(x |θ) (ie (θ,X ) ∼ g(·))

2 Accept (θ,X ) if

U ∼ U[0, 1] ≤ πABC (θ, x)

Mg(θ, x)=

πε(D|X )

maxx πε(D|x)

In uniform ABC we take

πε(D|X ) =

{1 if ρ(D,X ) ≤ ε0 otherwise

this reduces the algorithm to

2’ Accept θ ifF ρ(D,X ) ≤ εie, we recover the uniform ABC algorithm.

Generalized ABC (GABC)Wilkinson 2008, Fearnhead and Prangle 2012

The rejection algorithm then becomes

Generalized rejection ABC (Rej-GABC)

1 θ ∼ π(θ) and X ∼ π(x |θ) (ie (θ,X ) ∼ g(·))

2 Accept (θ,X ) if

U ∼ U[0, 1] ≤ πABC (θ, x)

Mg(θ, x)=

πε(D|X )

maxx πε(D|x)

In uniform ABC we take

πε(D|X ) =

{1 if ρ(D,X ) ≤ ε0 otherwise

this reduces the algorithm to

2’ Accept θ ifF ρ(D,X ) ≤ εie, we recover the uniform ABC algorithm.

Uniform ABC algorithm

This allows us to interpret uniform ABC. Suppose X ,D ∈ R

Proposition

Accepted θ from the uniform ABC algorithm (with ρ(D,X ) = |D − X |)are samples from the posterior distribution of θ given D where we assumeD = f (θ) + e and that

e ∼ U[−ε, ε]

In general, uniform ABC assumes that

D|x ∼ U{d : ρ(d , x) ≤ ε}

i.e., D is generated by adding noise uniformly chosen from a ball of radiusε around the best simulator output f (θ).

ABC gives ‘exact’ inference under a different model!

Uniform ABC algorithm

This allows us to interpret uniform ABC. Suppose X ,D ∈ R

Proposition

Accepted θ from the uniform ABC algorithm (with ρ(D,X ) = |D − X |)are samples from the posterior distribution of θ given D where we assumeD = f (θ) + e and that

e ∼ U[−ε, ε]

In general, uniform ABC assumes that

D|x ∼ U{d : ρ(d , x) ≤ ε}

i.e., D is generated by adding noise uniformly chosen from a ball of radiusε around the best simulator output f (θ).

ABC gives ‘exact’ inference under a different model!

Acceptance Kernel - π(D|x)Kennedy and O’Hagan 2001, Goldstein and Rougier 2009

How do we relate the simulator to reality?

1 Measurement error - D = ζ + e - let πε(D|X ) be the distribution e.

2 Model error - ζ = f (θ) + δ - let πε(D|X ) be the distribution ε.

Or both: πε(D|x) a convolution of the two distributions

3 Sampling a hidden space - often the data D are noisy observations ofsome latent feature (call it X ), which is generated by a stochasticprocess. By removing the stochastic sampling from the simulator wecan let π(D|x) do the sampling for us (Rao-Blackwellisation).

Kernel SmoothingBlum 2010, Fearnhead and Prangle 2012

Viewing ABC as an extension of modelling isn’t commonly done.

allows us to do the inference we want (and to interpret)I - makes explicit the relationship between simulator and observations.

allows for the possibility of more efficient ABC algorithms

A different but equivalent view of ABC is as kernel smoothing

πABC (θ|D) ∝∫

Kε(D − x)π(x |θ)π(θ)dx

where Kε(x) = 1/εK (x/ε) and K is a standard kernel and ε is thebandwidth.

Kernel SmoothingBlum 2010, Fearnhead and Prangle 2012

Viewing ABC as an extension of modelling isn’t commonly done.

allows us to do the inference we want (and to interpret)I - makes explicit the relationship between simulator and observations.

allows for the possibility of more efficient ABC algorithms

A different but equivalent view of ABC is as kernel smoothing

πABC (θ|D) ∝∫

Kε(D − x)π(x |θ)π(θ)dx

where Kε(x) = 1/εK (x/ε) and K is a standard kernel and ε is thebandwidth.

Efficient Algorithms

References:

Marjoram et al. 2003

Sisson et al. 2007


Toni et al. 2009

Del Moral et al. 2011

Drovandi et al. 2011

ABCifying Monte Carlo methods

Rejection ABC is the basic ABC algorithm.

Inefficient as it repeatedly samples from prior

A large number of papers have been published turning other MCalgorithms into ABC type algorithms for when we don’t know thelikelihood: IS, MCMC, SMC, EM, EP etc

Focus on MCMC and SMC

presented for GABC with acceptance kernels, but most thealgorithms were written down for uniform ABC, i.e.,

πε(D|X ) = Iρ(D,X )≤ε

and we can make this choice in most cases if desired.

MCMC-ABCMarjoram et al. 2003

We are targeting the joint distribution

πABC (θ, x |D) ∝ πε(D|x)π(x |θ)π(θ)

To explore the (θ, x) space, proposals of the form

Q((θ, x), (θ′, x ′)) = q(θ, θ′)π(x ′|θ′)seem to be inevitable (q arbitrary).

The Metropolis-Hastings (MH) acceptance probability is then

r =πABC (θ′|D)Q((θ′, x ′), (θ, x))

πABC (θ|D)Q((θ, x), (θ′, x ′))

=πε(D|x ′)π(x ′|θ′)π(θ′)q(θ′, θ)π(x |θ)

πε(D|x)π(x |θ)π(θ)q(θ, θ′)π(x ′|θ′)

=πε(D|x ′)q(θ′, θ)π(θ′)πε(D|x)q(θ, θ′)π(θ)







r =πABC (θ′|D)Q((θ′, x ′), (θ, x))

πABC (θ|D)Q((θ, x), (θ′, x ′))










r =πABC (θ′|D)Q((θ′, x ′), (θ, x))

πABC (θ|D)Q((θ, x), (θ′, x ′))




This gives the following MCMC kernel

MH-ABC - PMarj(θ0, ·)1 Propose a move from zt = (θ, x) to (θ′, x ′) using proposal Q above.

2 Accept move with probability

r((θ, x), (θ′, x ′)) = min

(1,πε(D|x ′)q(θ′, θ)π(θ′)πε(D|x)q(θ, θ′)π(θ)

),

otherwise set zt+1 = zt .

In practice, we find this algorithm often gets stuck at a given θ, as theprobability of generating x ′ near D can be tiny if ε is small.

Note that this is a special case of a ”pseudo marginal”Metropolis-Hastings algorithm, and can be modified to use multiplesimulations at each θ, i.e.

r = min

(1,

∑Ni=1 πε(D|x ′i )q(θ′, θ)π(θ′)∑Ni=1 πε(D|xi )q(θ, θ′)π(θ)

)to better approximate the likelihood.

This gives the following MCMC kernel

MH-ABC - PMarj(θ0, ·)1 Propose a move from zt = (θ, x) to (θ′, x ′) using proposal Q above.

2 Accept move with probability

r((θ, x), (θ′, x ′)) = min

(1,πε(D|x ′)q(θ′, θ)π(θ′)πε(D|x)q(θ, θ′)π(θ)

),

otherwise set zt+1 = zt .

In practice, we find this algorithm often gets stuck at a given θ, as theprobability of generating x ′ near D can be tiny if ε is small.

Note that this is a special case of a ”pseudo marginal”Metropolis-Hastings algorithm, and can be modified to use multiplesimulations at each θ, i.e.

r = min

(1,

∑Ni=1 πε(D|x ′i )q(θ′, θ)π(θ′)∑Ni=1 πε(D|xi )q(θ, θ′)π(θ)

)to better approximate the likelihood.

Recent developments - Lee 2012

1-hit MCMC kernel - P1hit(θ0, ·)1 Propose θ′ ∼ q(θt , ·)2 With probability

1−min

(1,

q(θ′, θt)π(θ′)q(θt , θ′)π(θt)

)set θt+1 = θt

3 Sample x ′ ∼ π(·|θ′) and x ∼ π(·|θt) until ρ(x ′,D) ≤ ε orρ(x ,D) ≤ ε.

4 If ρ(x ′,D) ≤ ε set θt+1 = θ′ otherwise set θt+1 = θt

Recent developmentsLee et al. 2013 showed PMarj is neither

variance boundingI Let Eh(θ) = 1

m

∑h(θi ) - Markov kernel P is variance bounding if

VarP(Eh(θ)) is ”reasonably small”

nor geometrically ergodic (GE) i.e ||Pm(θ0, ·)− πABC (·)||TV ≤ Cρm

where ρ < 1. Markov kernels that are not GE may convergenceextremely slowly.

whereas P1hit is (subject to conditions).

(a) ✓1 (b) ✓2 (c) ✓3

Figure 6: Density estimates of the marginal posteriors for the Lotka-Volterra model.

0.79

0.80

0.81

0.82

0.83

0e+00 1e+07 2e+07 3e+07 4e+07 5e+07

(a) P1,1

0.79

0.80

0.81

0.82

0.83

0e+00 1e+06 2e+06 3e+06 4e+06 5e+06

(b) P1,15

0.79

0.80

0.81

0.82

0.83

0e+00 1e+06 2e+06 3e+06 4e+06 5e+06

(c) P2,15

0.79

0.80

0.81

0.82

0.83

0e+00 1e+06 2e+06 3e+06 4e+06 5e+06

(d) P3

Figure 7: Estimates of the posterior mean of ✓3 by iteration using each kernel. The three horizontal linescorrespond to the estimate obtained using the rejection sampler with two estimated standard deviationsadded and subtracted.

and historical uses in statistics can be traced through Feller (1940); Doob (1945); Kendall (1949, 1950), andthe method was rediscovered in Gillespie (1977) in the context of stochastic kinetic models. These articlesdevelop a straightforward way to simulate the full process X1:2(t), t 2 [0, 10] as the inter-jump times areexponential random variables, although more sophisticated approaches are possible (see, e.g., Wilkinson,2011, Chapter 8).

The data was simulated with ✓ = (1, 0.005, 0.6), an example from Wilkinson (2011, p. 152). Our observa-tions are both partial and discrete with y = {88, 165, 274, 268, 114, 46, 32, 36, 53, 92} the simulated valuesof X1 at times {1, 2, . . . , 10}, and for approximate Bayesian computation we use a log transformation ofX1(t) and y(t) with ✏ = 1, i.e.

B✏(y) = {X1(t) : 8i 2 {1, . . . , 10}, log(X1(i))� log(y(i)) ✏} .

We first model ✓ 2 ⇥ = [0,1)3 with p(✓) = 100 exp(�✓1 � 100✓2 � ✓3) and use q(✓,#) = N (#; ✓,⌃)where ⌃ = diag(.25, 0.0025, .25). Density plots of the marginal posteriors for each component of ✓ areshown in Figure 6, obtained using 106 samples from ⇡ using a rejection sampler. ✓1 has a tighter posteriorthan ✓3 and while not shown here, the samples indicate strong positive correlation between ✓2 and ✓3. In thissetting, P3 for 5⇥ 106 iterations gave an average value of n of 15 and we also ran kernels P1,1 = P2,1 for5⇥107 iterations and P1,15 and P2,15 both for 5⇥106 iterations. All kernels gave density estimates visiblyindistinguishable from those in Figure 6, but inspection of their partial sums by iteration reveals importantdifferences. In Figures 7 and 8 we show estimates of the posterior mean of ✓3 and the probability that✓3 � 1.79 for each chain, accompanied by lines corresponding to the estimate obtained using the samplesfrom the rejection sampler. P3 seems to accurately estimate both the same value as the estimate from therejection sampler and the uncertainty of the estimate seems to be correlated with perturbations of the partialsum. However, the other kernels seem to both miss the value of interest by some amount and, particularly

11

Note that P1hit requires significantly more computation per iteration (butthis may be worth it)

Recent developmentsLee et al. 2013 showed PMarj is neither

variance boundingI Let Eh(θ) = 1

m

∑h(θi ) - Markov kernel P is variance bounding if

VarP(Eh(θ)) is ”reasonably small”

nor geometrically ergodic (GE) i.e ||Pm(θ0, ·)− πABC (·)||TV ≤ Cρm

where ρ < 1. Markov kernels that are not GE may convergenceextremely slowly.

whereas P1hit is (subject to conditions).

(a) ✓1 (b) ✓2 (c) ✓3

Figure 6: Density estimates of the marginal posteriors for the Lotka-Volterra model.

0.79

0.80

0.81

0.82

0.83

0e+00 1e+07 2e+07 3e+07 4e+07 5e+07

(a) P1,1

0.79

0.80

0.81

0.82

0.83

0e+00 1e+06 2e+06 3e+06 4e+06 5e+06

(b) P1,15

0.79

0.80

0.81

0.82

0.83

0e+00 1e+06 2e+06 3e+06 4e+06 5e+06

(c) P2,15

0.79

0.80

0.81

0.82

0.83

0e+00 1e+06 2e+06 3e+06 4e+06 5e+06

(d) P3

Figure 7: Estimates of the posterior mean of ✓3 by iteration using each kernel. The three horizontal linescorrespond to the estimate obtained using the rejection sampler with two estimated standard deviationsadded and subtracted.

and historical uses in statistics can be traced through Feller (1940); Doob (1945); Kendall (1949, 1950), andthe method was rediscovered in Gillespie (1977) in the context of stochastic kinetic models. These articlesdevelop a straightforward way to simulate the full process X1:2(t), t 2 [0, 10] as the inter-jump times areexponential random variables, although more sophisticated approaches are possible (see, e.g., Wilkinson,2011, Chapter 8).

The data was simulated with ✓ = (1, 0.005, 0.6), an example from Wilkinson (2011, p. 152). Our observa-tions are both partial and discrete with y = {88, 165, 274, 268, 114, 46, 32, 36, 53, 92} the simulated valuesof X1 at times {1, 2, . . . , 10}, and for approximate Bayesian computation we use a log transformation ofX1(t) and y(t) with ✏ = 1, i.e.

B✏(y) = {X1(t) : 8i 2 {1, . . . , 10}, log(X1(i))� log(y(i)) ✏} .

We first model ✓ 2 ⇥ = [0,1)3 with p(✓) = 100 exp(�✓1 � 100✓2 � ✓3) and use q(✓,#) = N (#; ✓,⌃)where ⌃ = diag(.25, 0.0025, .25). Density plots of the marginal posteriors for each component of ✓ areshown in Figure 6, obtained using 106 samples from ⇡ using a rejection sampler. ✓1 has a tighter posteriorthan ✓3 and while not shown here, the samples indicate strong positive correlation between ✓2 and ✓3. In thissetting, P3 for 5⇥ 106 iterations gave an average value of n of 15 and we also ran kernels P1,1 = P2,1 for5⇥107 iterations and P1,15 and P2,15 both for 5⇥106 iterations. All kernels gave density estimates visiblyindistinguishable from those in Figure 6, but inspection of their partial sums by iteration reveals importantdifferences. In Figures 7 and 8 we show estimates of the posterior mean of ✓3 and the probability that✓3 � 1.79 for each chain, accompanied by lines corresponding to the estimate obtained using the samplesfrom the rejection sampler. P3 seems to accurately estimate both the same value as the estimate from therejection sampler and the uncertainty of the estimate seems to be correlated with perturbations of the partialsum. However, the other kernels seem to both miss the value of interest by some amount and, particularly

11

Note that P1hit requires significantly more computation per iteration (butthis may be worth it)

Importance sampling GABC

In uniform ABC, importance sampling simply reduces to the rejectionalgorithm with a fixed budget for the number of simulator runs.

But for GABC it opens new algorithms:

GABC - Importance sampling

1 θi ∼ π(θ) and Xi ∼ π(x |θi ).

2 Give (θi , xi ) weight wi = πε(D|xi ).

Which is more efficient - IS-GABC or Rej-GABC?

Proposition 2

IS-GABC has a larger effective sample size than Rej-GABC, or equivalently

VarRej(w) ≥ VarIS(w)

This can be seen as a Rao-Blackwell type result.

Importance sampling GABC

In uniform ABC, importance sampling simply reduces to the rejectionalgorithm with a fixed budget for the number of simulator runs.

But for GABC it opens new algorithms:

GABC - Importance sampling


2 Give (θi , xi ) weight wi = πε(D|xi ).

Which is more efficient - IS-GABC or Rej-GABC?

Proposition 2

IS-GABC has a larger effective sample size than Rej-GABC, or equivalently

VarRej(w) ≥ VarIS(w)

This can be seen as a Rao-Blackwell type result.

Rejection Control (RC)A difficulty with IS algorithms is that they can require the storage of alarge number of particles with small weights.

thin particles with small weights using rejection control:

Rejection Control in IS-GABC

1 θi ∼ π(θ) and Xi ∼ π(X |θi )2 Accept (θi ,Xi ) with probability

r(Xi ) = min

(1,πε(D|Xi )

C

)for any threshold constant C ≥ 0.

3 Give accepted particles weights

wi = max(πε(D|Xi ),C )

IS is more efficient than RC, unless we have memory constraints (relativeto processor time).

Sequential ABC algorithms

The most popular efficient ABC algorithms are those based on sequentialmethods (Sisson et al. 2007, Toni et al. 2008, Beaumont et al. 2009, ....).

We aim to sample N particles successively from a sequence of distributions

π1(θ), . . . , πT (θ) = target

For ABC we decide upon a sequence of tolerances ε1 > ε2 > . . . > εT andlet πt be the ABC distribution found by the ABC algorithm when we usetolerance εt .

Specifically, define a sequence of target distributions

πt(θ, x) =πt(D|x)π(x |θ)π(θ)

Ct=γt(θ, x)

Ct

with πt(D|X ) = πεt (D|X )

ABC SMC (Toni et al., 2009)

(a) As in ABC rejection, we define a priordistribution P (✓) and we would like to approxi-mate a posterior distribution P (✓|D0). In ABCSMC we do this sequentially by constructingintermediate distributions, which convergeto the posterior distribution. We define atolerance schedule ✏1 > ✏2 > . . . ✏T � 0.

(b) We sample particles from a prior distribu-tion until N particles have been accepted (havereached the distance smaller than ✏1). For allaccepted particles we calculate weights (see[4] for formulas and derivation). We call thesample of all accepted particles ”Population1”.

(c) We then sample a particle ✓⇤ from popu-lation 1 and perturb it to obtain a perturbedparticle ✓⇤⇤ ⇠ K(✓|✓⇤), where K is a per-turbation kernel (for example a Gaussianrandom walk). We then simulate a datasetD⇤ ⇠ f(D|✓⇤⇤) and accept the particle ✓⇤⇤

if d(D0, D⇤⇤) ✏2. We repeat this until we

have accepted N particles in population 2. Wecalculate weights for all accepted particles.

(d) We repeat the same procedure for thefollowing populations, until we have acceptedN particles of the last population T andcalculated their weights. Population T is asample of particles that approximates theposterior distribution.

ABC SMC is computationally much moree�cient than ABC rejection (see [4] forcomparison).

ABC SMC (Sequential Monte Carlo)

Intermediate DistributionsPrior Posterior

✏1 ✏2 . . . ✏T�1 ✏T

Population 1 Population 2 Population T

Tina Toni, Michael Stumpf ABC dynamical systems 03/07/08 1 / 1

(a)

(b)

(c)

(d)

Figure 2: Schematic representation of ABCSMC.

3

Picture from Toni and Stumpf 2010 tutorial

At each stage t, we aim to construct a weighted sample of particles thatapproximates πt(θ, x).{(

z(i)t ,W

(i)t

)}N

i=1such that πt(z) ≈

∑W

(i)t δ

z(i)t

(dz)

where z(i)t = (θ

(i)t , x

(i)t ).

ABC SMC (Toni et al., 2009)

(a) As in ABC rejection, we define a priordistribution P (✓) and we would like to approxi-mate a posterior distribution P (✓|D0). In ABCSMC we do this sequentially by constructingintermediate distributions, which convergeto the posterior distribution. We define atolerance schedule ✏1 > ✏2 > . . . ✏T � 0.

(b) We sample particles from a prior distribu-tion until N particles have been accepted (havereached the distance smaller than ✏1). For allaccepted particles we calculate weights (see[4] for formulas and derivation). We call thesample of all accepted particles ”Population1”.

(c) We then sample a particle ✓⇤ from popu-lation 1 and perturb it to obtain a perturbedparticle ✓⇤⇤ ⇠ K(✓|✓⇤), where K is a per-turbation kernel (for example a Gaussianrandom walk). We then simulate a datasetD⇤ ⇠ f(D|✓⇤⇤) and accept the particle ✓⇤⇤

if d(D0, D⇤⇤) ✏2. We repeat this until we

have accepted N particles in population 2. Wecalculate weights for all accepted particles.

(d) We repeat the same procedure for thefollowing populations, until we have acceptedN particles of the last population T andcalculated their weights. Population T is asample of particles that approximates theposterior distribution.

ABC SMC is computationally much moree�cient than ABC rejection (see [4] forcomparison).

(a)

(b)

(c)

ABC SMC (Sequential Monte Carlo)

Intermediate DistributionsPrior Posterior

✏1 ✏2 . . . ✏T�1 ✏TPopulation 1 Population 2 Population T

Tina Toni, Michael Stumpf ABC dynamical systems 03/07/08 1 / 1

(d)

Figure 2: Schematic representation of ABCSMC.

3

Picture from Toni and Stumpf 2010 tutorial

Toni et al. (2008)Assume we have a cloud of weighted particles {(θi ,wi )}Ni=1 that wereaccepted at step t − 1.

1 Sample θ from the previous population according to the weights.2 Perturb the particles according to perturbation kernel qt . I.e.,

θ ∼ qt(θ, ·)3 Reject particle immediately if θ has zero prior density, i.e., if

π(θ) = 0

4 Otherwise simulate X ∼ f (θ) from the simulator. Ifρ(S(X ), S(D)) ≤ εt accept the particle, otherwise reject.

5 Give the accepted particle weight

wi =π(θ)∑

θiqt(θi , θ)

6 Repeat steps 1-5 until we have N accepted particles at step t.

Sequential Monte Carlo (SMC)All the SMC-ABC algorithms can be understood as special cases of DelMoral et al. 2006.

If at stage t we use proposal distribution ηt(z) for the particles, then wecreate the weighted sample as follows:

Generic Sequential Monte Carlo - stage n

(i) For i = 1, . . . ,N

Z(i)t ∼ ηt(z)

and correct between ηt and πt

wt(Z(i)t ) =

γt(Z(i)t )

ηt(Z(i)t )

(ii) Normalize to find weights {W (i)t }.

(iii) If effective sample size (ESS) is less than some threshold T,

resample the particles and set W(i)t = 1/N. Set t = t + 1.

Del Moral et al. SMC algorithm

We can build the proposal distribution ηt(z), from the particles availableat time t − 1.

One way to do this is to propose new particles by passing the old particlesthrough a Markov kernel qt(z , z

′).

For i = 1, . . . ,N

z(i)n ∼ qt(z

(i)t−1, ·)

This makes ηt(z) =∫ηt−1(z ′)qt(z ′, z)dz′ – which is unknown in general.

Del Moral et al. 2006 showed how to avoid this problem by introducing asequence of backward kernels, Lt−1.

Del Moral et al. 2006 SMC algorithm - step n

(i) Propagate: Extend the particle paths using Markov kernel qt .

For i = 1, . . . ,N, Z(i)n ∼ Qt(z

(i)t−1, ·)

(ii) Weight: Correct between ηn(z0:n) and πt(z0:n). For i = 1, . . . ,N

wt(z(i)0:n) =

γt(z(i)0:n)

ηt(z(i)0:n)

(1)

= Wt−1(z(i)0:n−1)wt(z

(i)t−1, z

(i)t ) (2)

where

wt(z(i)t−1, z

(i)t ) =

γt(z(i)t )Lt−1(z

(i)t , z

(i)t−1)

γt−1(z(i)t−1)Qt(z

(i)t−1, z

(i)t )

(3)

is the incremental weight.

(iii) Normalise the weights to obtain {W (i)t }.

(iv) Resample if ESS< T and set W(i)n = 1/N for all i . Set n = n + 1.

SMC with partial rejection control (PRC)

We can add in the rejection control idea of Liu

Del Moral SMC algorithm with Partial Rejection Control - step n

(i) For i = 1, . . . ,N

(a) Sample z∗ from {z (i)t−1} according to weights W(i)t−1.

(b) Perturb:z∗∗ ∼ Qt(z

∗, ·)(c) Weight

w∗ =γt(z

(i)t )Lt−1(z

(i)t , z

(i)t−1)

γt−1(z(i)t−1)Qt(z

(i)t−1, z

(i)t )

(d) PRC: Accept z∗ with probability min(1, w∗

ct). If accepted set z

(i)t = z∗∗

and set w(i)t = max(w∗, ct). Otherwise return to (a).

(ii) Normalise the weights to get W(i)t .

GABC versions of SMC

We need to choose

Sequence of targets πt

Forward perturbation kernels Kt

Backward kernels Lt

Thresholds ct .

By making particular choices for these quantities we can recover many ofthe published SMC-ABC samplers.

Uniform SMC-ABCFor example,

let πt be the uniform ABC target using εt ,

πt(D|X ) =

{1 if ρ(D,X ) ≤ εt0 otherwise

let Qt(z , z′) = qt(θ, θ

′)π(x ′|θ)

let c1 = 1 and ct = 0 for n ≥ 2

let

Lt−1(zt , zt−1) =πt−1(zt−1)Qt(zt−1, zt)

πt−1Qt(zt)

and approximate πt−1Qt(z) =∫πt−1(z ′)Qt(z

′, z)dz′ by

πt−1Qt(z) ≈∑j

W(j)t−1Qt(z

(j)t−1, z)

then the algorithm reduces to Beaumont et al. 2008. We recover theSisson et al. 2007 algorithm if we add in a further resampling step. Toniet al. 2009 is recovered by including a compulsory resampling step.

Other sequential GABC algorithms

We can combine SMC with MCMC type moves, by using

Lt−1(zt , zt−1) =πt−1(zt−1)Qt(zt−1, zt)

πt−1Qt(zt)

If we then use a πt invariant Metropolis-Hastings kernel Qt and let

Lt−1(zt , zt−1) =πt(zt−1)Qt(zt−1, zt)

πt(zt)

then we get an ABC resample-move algorithm.

Approximate Resample-Move (with PRC)

RM-GABC

(i) While ESS < N

(a) Sample z∗ = (θ∗,X ∗) from {z (i)t−1} according to weights W(i)t−1.

(b) Weight:

w∗ = wt(X∗) =

πt(D|X ∗)πt−1(D|X ∗)

(c) PRC: With probability min(1, w∗

ct), sample

z(i)t ∼ Qt(z

∗, ·)

where Qt is an MCMC kernel with invariant distribution πt . Seti = i + 1.Otherwise, return to (i)(a).

(ii) Normalise the weights to get W(i)t . Set n = n + 1

Note that because the incremental weights are independent of zt we areable to swap the perturbation and PRC steps.

Conclusions

The tolerance ε controls the accuracy of ABC algorithms, and so wedesire to take ε as small as possible in many problems (although notalways).

By using efficient sampling algorithms, we can hope to better use theavailable computation resource to spend more time simulating inregions of parameter space likely to lead to accepted values

MCMC and SMC versions of ABC have been developed, along withABC versions of most other algorithms.

Links to other approaches

History-matchinge.g. Craig et al. 2001, Vernon et al. 2010

ABC can be seen as a probabilistic version of history matching. Historymatching is used in the analysis of computer experiments to rule outregions of space as implausible.

1 Relate the simulator to the system

ζ = f (θ) + ε

where ε is our simulator discrepancy

2 Relate the system to the data (e represents measurement error)

D = ζ + e

3 Declare θ implausible if, e.g.,

‖ D − Ef (θ) ‖> 3σ

where σ2 is the combined variance implied by the emulator,discrepancy and measurement error.

History-matching

If θ is not implausible we don’t discard it. The result is a region of spacethat we can’t rule out at this stage of the history-match.

Usual to go through several stages of history matching.Notes:

History matching can be seen as a principled version of ABC - lots ofthought goes into the link between simulator and reality.

The result of history-matching may be that there is nonot-implausible region of parameter space

I Go away and think harder - something is misspecifiedI This can also happen in rejection ABC.I In contrast, MCMC will always give an answer, even if the model is

terrible.

Noisy-ABCFearnhead and Prangle (2012) proposed the noisy-ABC algorithm:

Noisy-ABC

Initialise: Let D ′ = D + e where e ∼ K (e) from some kernel K (·).


2 Give (θi , xi ) weight wi = K (Xi − D ′).

In our notation, replace the observed data D, with D ′ drawn from theacceptance kernel - D ′ ∼ π(D ′|D)

The main argument in favour of noisy-ABC is that it is calibrated, unlikestandard ABC.

PABC is calibrated if

P(θ ∈ A|Eq(A)) = q

where Eq(A) is the event that the ABC posterior assigns probabilityq to event A i.e., given that PABC (A) = q, then we are calibrated ifA occurs with probability q according to base measure P (defined byprior, simulator likelihood and K ).

Noisy-ABCFearnhead and Prangle (2012) proposed the noisy-ABC algorithm:

Noisy-ABC

Initialise: Let D ′ = D + e where e ∼ K (e) from some kernel K (·).


2 Give (θi , xi ) weight wi = K (Xi − D ′).

In our notation, replace the observed data D, with D ′ drawn from theacceptance kernel - D ′ ∼ π(D ′|D)

The main argument in favour of noisy-ABC is that it is calibrated, unlikestandard ABC.

PABC is calibrated if

P(θ ∈ A|Eq(A)) = q

where Eq(A) is the event that the ABC posterior assigns probabilityq to event A i.e., given that PABC (A) = q, then we are calibrated ifA occurs with probability q according to base measure P (defined byprior, simulator likelihood and K ).

Noisy ABC

Noisy ABC is well calibrated. However, this is a frequency property, andso it only becomes relevant if we repeat the analysis with different D ′

many times

highly relevant to filtering problems

Note that noisy ABC and GABC are trying to do different things:

Noisy ABC moves the data so that it comes from the model we areassuming when we do inference.

I Assumes the model π(D|θ) is true and tries to find the true posteriorgiven the noisy data.

GABC accepts the model is incorrect, and tries to account for this inthe inference.

Other algorithms

Wood 2010 is an ABC algorithm, but using sample mean µθ andcovariance Σθ of the summary of f (θ) run n times at θ, and assuming

π(D|S) = N (D;µθ,Σθ)

(Generalized Likelihood Uncertainty Estimation) GLUE approach ofKeith Beven in hydrology - see Nott and Marshall 2012

Kalman filtering, see Nott et al. 2012.

The dangers of ABC - H.L. Mencken

For every complex problem, there is an answer that is short,simple and wrong

Why use ABC? J. Galsworthy

Idealism increases in direct proportion to ones distance fromthe problem

Recap I

Uniform Rejection ABC

Draw θ from π(θ)



We’ve looked at a variety of more efficient sampling algorithms

e.g. ABC-MCMC, ABC-IS, ABC-SMC

The higher the efficiency the smaller the tolerance we can use for agiven computational expense.

Recap II

Alternative approaches focus on avoiding the curse of dimensionality:

If the data are too high dimensional we never observe simulationsthat are ‘close’ to the field data

Approaches include

Using summary statistics S(D) to reduce the dimension

Uniform rejection ABC with summaries

I Draw θ from π(θ)

I Simulate X ∼ f (θ)

I Accept θ if ρ(S(D),S(X )) < ε


Regression adjustment - model and account for the discrepancybetween S = S(X ) and Sobs = S(D).

Recap II



Approaches include








Recap II



Approaches include








Regression Adjustment

References:


Blum and Francois 2010

Blum 2010

Leuenberger and Wegmann 2010

Regression Adjustment

An alternative to rejection-ABC, proposed by Beaumont et al. 2002, usespost-hoc adjustment of the parameter values to try to weaken the effectof the discrepancy between s and sobs .

Two key ideas

use non-parametric kernel density estimation to emphasise the bestsimulations

learn a non-linear model for the conditional expectation E(θ|s) as afunction of s and use this to learn the posterior at sobs .

Idea 1 - kernel regression

Suppose we want to estimate

E(θ|sobs) =

∫θπ(θ, sobs)

π(sobs)dθ

using pairs {θi , si} from π(θ, s)

Approximating the two densities using a kernel density estimate

π(θ, s) =1

n

∑i

K (s − si )K (θ − θi ) π(s) =1

n

∑i

K (s − si )

and substituting gives the Naradaya-Watson estimator:

E(θ|sobs) ≈∑

i K (sobs − si )θi∑i K (sobs − si )

as∫yK (y − a)dy = a.

Idea 1 - kernel regression

Suppose we want to estimate

E(θ|sobs) =

∫θπ(θ, sobs)

π(sobs)dθ

using pairs {θi , si} from π(θ, s)

Approximating the two densities using a kernel density estimate

π(θ, s) =1

n

∑i

K (s − si )K (θ − θi ) π(s) =1

n

∑i

K (s − si )

and substituting gives the Naradaya-Watson estimator:

E(θ|sobs) ≈∑

i K (sobs − si )θi∑i K (sobs − si )

as∫yK (y − a)dy = a.

Beaumont et al. 2002 suggested using the Epanechnikov kernel

Kε(x) =c

ε

[1−

(xε

)2]Ix≤ε

as it has finite support - we discard the majority of simulations. Theyrecommend ε be set by deciding on the proportion of simulations tokeep e.g. best 5%

This expression also arises if we view

{θi ,Wi}, with Wi = Kε(sobs − si ) ≡ πε(sobs |si )

as a weighted particle approximation to the posterior

π(θ|sobs) =∑

wiδθi (θ)

where wi = Wi/∑

Wj are normalised weights

The Naradaya-Watson estimator suffers from the curse ofdimensionality - its rate of convergence drops rapidly as thedimension of s increases.

Idea 2 - regression adjustmentsConsider the relationship between the conditional expectation of θ and s:

E(θ|s) = m(s)

Think of this as a model for the conditional density π(θ|s): for fixed s

θi = m(s) + ei

where θi ∼ π(θ|s) and ei are zero-mean and uncorrelated

Suppose we’ve estimated m(s) by m(s) from samples {θi , si}.Estimate the posterior mean by

E(θ|sobs) ≈ m(sobs)

and assuming constant variance (wrt s), we can form the empiricalresiduals

ei = θi − m(si )

and approximate the posterior π(θ|sobs) by adjusting the parameters

θ∗i = m(sobs) + ei = θi + (m(sobs)− m(si ))

Idea 2 - regression adjustmentsConsider the relationship between the conditional expectation of θ and s:

E(θ|s) = m(s)

Think of this as a model for the conditional density π(θ|s): for fixed s

θi = m(s) + ei

where θi ∼ π(θ|s) and ei are zero-mean and uncorrelated

Suppose we’ve estimated m(s) by m(s) from samples {θi , si}.Estimate the posterior mean by

E(θ|sobs) ≈ m(sobs)

and assuming constant variance (wrt s), we can form the empiricalresiduals

ei = θi − m(si )

and approximate the posterior π(θ|sobs) by adjusting the parameters

θ∗i = m(sobs) + ei = θi + (m(sobs)− m(si ))

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

180 190 200 210 220

1.7

1.8

1.9

2.0

2.1

2.2

2.3

ABC and regression adjustment

S

thet

a

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●●

● ●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

− ε + εs_obs

In rejection ABC, the red points are used to approximate the histogram.

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

180 190 200 210 220

1.7

1.8

1.9

2.0

2.1

2.2

2.3

ABC and regression adjustment

S

thet

a

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●●

● ●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

− ε + εs_obs

●

●

●

●

In rejection ABC, the red points are used to approximate the histogram.Using regression-adjustment, we use the estimate of the posterior mean atsobs and the residuals from the fitted line to form the posterior.

Models

Beaumont et al. 2003 used a local linear model for m(s) in the vicinity ofsobs

m(si ) = α + βT si

fit by minimising ∑(θi −m(si ))2Kε(si − sobs)

so that observations nearest to sobs are given more weight in the fit.

The empirical residuals are then weighted so that the approximation tothe posterior is a weighted particle set

{θ∗i ,Wi = Kε(si − sobs)}π(θ|sobs) = m(sobs) +

∑wiδθ∗i (θ)

Models

Beaumont et al. 2003 used a local linear model for m(s) in the vicinity ofsobs

m(si ) = α + βT si

fit by minimising ∑(θi −m(si ))2Kε(si − sobs)

so that observations nearest to sobs are given more weight in the fit.

The empirical residuals are then weighted so that the approximation tothe posterior is a weighted particle set

{θ∗i ,Wi = Kε(si − sobs)}π(θ|sobs) = m(sobs) +

∑wiδθ∗i (θ)

Normal-normal conjugate model, linear regression

1.8 1.9 2.0 2.1

02

46

8

Posteriors

theta

Den

sity

ABCTrueRegression adjusted

200 data points in both approximations. The regression-adjusted ABCgives a more confident posterior, as the θi have been adjusted to accountfor the discrepancy between si and sobs

Extensions: Non-linear modelsBlum and Francois 2010 proposed a nonlinear heteroscedastic model

θi = m(si ) + σ(su)ei

where m(s) = E(θ|s) and σ2(s) = Var(θ|s). They used feed-forwardneural networks for both the conditional mean and variance.

Blum and OF (2009) suggest the use of non-linear conditional heteroscedastic regression models

θ∗i = m(sobs)+

(θi − m(si ))σ(sobs)

σ(si )

Picture from Michael Blum, www.ceremade.dauphine.fr/ xian/ABCOF.pdf

Discussion

These methods allow us to use a larger tolerance values and cansubstantially improve posterior accuracy with less computation.However, sequential algorithms can not easily be adapted, and sothese methods tend to be used with simple rejection sampling.

Many people choose not to use these methods, as they can give poorresults if the model is badly chosen.

Modelling variance is hard, so transformations to make the θ = m(s)as homoscedastic as possible (such as Box-Cox transformations) areusually applied

Blum 2010 contains estimates of the bias and variance of theseestimators. They show the properties of the ABC estimators mayseriously deteriorate as dim(s) increases ...

Summary Statistics

References:

Blum, Nunes, Prangle and Sisson 2012

Joyce and Marjoram 2008

Nunes and Balding 2010

Fearnhead and Prangle 2012

Robert et al. 2011

Error trade-offBlum, Nunes, Prangle, Fearnhead 2012

The error in the ABC approximation can be broken into two parts1 Choice of summary:

π(θ|D)?≈ π(θ|S(D))

2 Use of ABC acceptance kernel:

π(θ|sobs)?≈ πABC (θ|sobs) =

∫π(θ, s|sobs)ds

∝∫πε(sobs |S(x))π(x |θ)π(θ)dx

The first approximation allows the matching between S(D) and S(X ) tobe done in a lower dimension. There is a trade-off

dim(S) small: π(θ|sobs) ≈ πABC (θ|sobs), but π(θ|sobs) 6≈ π(θ|D)

dim(S) large: π(θ|sobs) ≈ π(θ|D) but π(θ|sobs) 6≈ πABC (θ|sobs)as curse of dimensionality forces us to use larger ε



π(θ|D)?≈ π(θ|S(D))



∫π(θ, s|sobs)ds







π(θ|D)?≈ π(θ|S(D))



∫π(θ, s|sobs)ds





Choosing summary statistics

If S(D) = sobs is sufficient for θ, i.e., sobs contains all the informationcontained in D about θ

π(θ|sobs) = π(θ|D),

then using summaries has no detrimental effect

However, low-dimensional sufficient statistics are rarely available. How do

we choose good low dimensional summaries?

The choice is one of the most important parts of ABC algorithms

Insights from ML methods?

















Automated summary selectionBlum, Nunes, Prangle and Fearnhead 2012

Suppose we are given a candidate set S = (s1, . . . , sp) of summaries fromwhich to choose.

Methods break down into groups.

Best subset selectionI Joyce and Marjoram 2008I Nunes and Balding 2010

ProjectionI Blum and Francois 2010I Fearnhead and Prangle 2012

Regularisation techniquesI Blum, Nunes, Prangle and Fearnhead 2012

Best subset selection

Introduce a criterion, e.g,

τ -sufficiency (Joyce and Marjoram 2008): s1:k−1 are τ -sufficientrelative to sk if

δk = supθ

log π(sk |s1:k−1, θ)− infθ

log π(sk |s1:k−1, θ)

= rangeθ(π(s1:k |θ)− π(s1:k−1|θ)) ≤ τ

i.e. adding sk changes posterior sufficiently.

Entropy (Nunes and Balding 2010)

Implement within a search algorithm such as forward selection.Problems:

assumes every change to posterior is beneficial (see below)

considerable computational effort required to compute δk

ProjectionSeveral statistics from S may be required to get same info content as asingle informative summary.

project S onto a lower dimensional highly informative summary vector

Most authors aim to find summaries so that

πABC (θ|s) ≈ π(θ|D)

Fearnhead and Prangle 2012 weaken this requirement and instead aim tofind summaries that lead to good parameter estimates.

They seek to minimise the expected posterior loss

E((θtrue − θ)2|D) =⇒ θ = E(θ|D)

They show that the optimal summary statistic is

s = E(θ|D)







E((θtrue − θ)2|D) =⇒ θ = E(θ|D)


s = E(θ|D)







E((θtrue − θ)2|D) =⇒ θ = E(θ|D)


s = E(θ|D)

Fearnhead and Prangle 2012However, E(θ|D) will not usually be known.

Instead, we can estimate it using the model

θi = E(θ|D) + ei = βT f (si ) + ei

where f (s) is a vector of functions of S and (θi , si ) are output from apilot ABC simulation. They choose the set of regressors using, e.g., BIC.

They then use the single summary statistic

s = βT f (s)

for θ.

Advantages

Scales well with large p and gives good point estimates.

Disadvantages

Summaries usually lack interpretability and method gives noguarantees about the approximation of the posterior.

Fearnhead and Prangle 2012However, E(θ|D) will not usually be known.

Instead, we can estimate it using the model

θi = E(θ|D) + ei = βT f (si ) + ei

where f (s) is a vector of functions of S and (θi , si ) are output from apilot ABC simulation. They choose the set of regressors using, e.g., BIC.

They then use the single summary statistic

s = βT f (s)

for θ.

Advantages

Scales well with large p and gives good point estimates.

Disadvantages

Summaries usually lack interpretability and method gives noguarantees about the approximation of the posterior.

Summary warning:

Automated methods are a poor replacement for expert knowledge.

Instead of automation, ask what aspects of the data do we expectour model to be able to reproduce? S(D) may be highly informativeabout θ, but if the model was not built to reproduce S(D) then whyshould we calibrate to it?

I For example, many dynamical systems models are designed to modelperiods and amplitudes. Summaries that are not phase invariant maybe informative about θ, but this information is uninformative.

In the case where models and/or priors are mis-specified, thisproblem can be particularly acute.

The rejection algorithm is usually used in summary selectionalgorithms, as otherwise we need to rerun the MCMC or SMCsampler for each new summary which is very expensive.

Summary warning:






Summary warning:






Model selectionWilkinson 2007, Grelaud et al. 2009

Ratmann et al. 2009 proposed methodology for testing the fit of a modelwithout reference to other models.

But often we want to compare models → Bayes factors

B12 =π(D|M1)

π(D|M2)

where π(D|Mi ) =∫πε(D|x)π(x |θ,Mi )π(θ)dxdθ

For rejection ABC

π(D) ≈ 1

N

∑πε(D|xi )

which reduces to the acceptance rate for uniform ABC (Wilkinson 2007).

Or add an initial step into the rejection algorithm where we first pick amodel - compare the ratio of acceptance rates to directly target the BF.

See Toni et al. 2009 for an SMC-ABC approach.

Model selectionWilkinson 2007, Grelaud et al. 2009

Ratmann et al. 2009 proposed methodology for testing the fit of a modelwithout reference to other models.

But often we want to compare models → Bayes factors

B12 =π(D|M1)

π(D|M2)

where π(D|Mi ) =∫πε(D|x)π(x |θ,Mi )π(θ)dxdθ

For rejection ABC

π(D) ≈ 1

N

∑πε(D|xi )

which reduces to the acceptance rate for uniform ABC (Wilkinson 2007).

Or add an initial step into the rejection algorithm where we first pick amodel - compare the ratio of acceptance rates to directly target the BF.

See Toni et al. 2009 for an SMC-ABC approach.

Summary statistics for model selectionDidelot et al. 2011, Robert et al. 2011

Care needs to be taken with regard summary statistics for model selection.Everything is okay if we target

BS =π(S(D)|M1)

π(S(D)|M2)

Then the ABC estimator BεS → BS as ε→ 0,N →∞ (Didelot et al.2011).

However,π(S(D)|M1)

π(S(D)|M2)6= π(D|M1)

π(D|M2)= BD

even if S is a sufficient statistic! S sufficient for f1(D|θ1) and f2(D|θ2)does not imply sufficiency for {m, fm(D|θm)}. Hence BεS 6→ BD

Note - no problem if we view inference as conditional on a carefullychosen S .See Prangle et al. 2013 for automatic selection of summaries for modelselection.



BS =π(S(D)|M1)

π(S(D)|M2)

Then the ABC estimator BεS → BS as ε→ 0,N →∞ (Didelot et al.2011).However,

π(S(D)|M1)

π(S(D)|M2)6= π(D|M1)

π(D|M2)= BD





BS =π(S(D)|M1)

π(S(D)|M2)

Then the ABC estimator BεS → BS as ε→ 0,N →∞ (Didelot et al.2011).However,

π(S(D)|M1)

π(S(D)|M2)6= π(D|M1)

π(D|M2)= BD



Choice of metric ρ

Consider the following system

Xt+1 = f (Xt) + N(0, σ2) (4)

Yt = g(Xt) + N(0, τ2) (5)

where we want to estimate measurement error τ and model error σ.Default choice of metric (or similar)

ρ(Y , yobs) =∑

(yobst − Yt)2

or CRPS (a proper scoring rule)

ρ(yobs ,F (·)) =∑

crps(yobst ,Ft(·)) =∑t

∫(Ft(u)− Iyt≤u)2du

where Ft(·) is the distribution function of Yt |y1:t−1.

sigma^2

tau^2

0 2 4 6 8 10

02

46

810

sigma^2

tau^2

0 2 4 6 8 10

02

46

810

GP-accelerated ABC

Problems with Monte Carlo methods

Monte Carlo methods are generally guaranteed to succeed if we run themfor long enough.

This guarantee comes at a cost.

Most methods sample naively - they don’t learn from previoussimulations.

They don’t exploit known properties of the likelihood function, suchas continuity

They sample randomly, rather than using space filling designs.

This naivety can make a full analysis infeasible without access to a largeamount of computational resource.

Likelihood estimation

The GABC framework assumes

π(D|θ) =

∫π(D|X )π(X |θ)dX

≈ 1

N

∑π(D|Xi )

where Xi ∼ π(X |θ).

For many problems, we believe the likelihood is continuous and smooth,so that π(D|θ) is similar to π(D|θ′) when θ − θ′ is small

We can model L(θ) = π(D|θ) and use the model to find the posterior inplace of running the simulator.

Likelihood estimation

The GABC framework assumes

π(D|θ) =

∫π(D|X )π(X |θ)dX

≈ 1

N

∑π(D|Xi )

where Xi ∼ π(X |θ).

For many problems, we believe the likelihood is continuous and smooth,so that π(D|θ) is similar to π(D|θ′) when θ − θ′ is small

We can model L(θ) = π(D|θ) and use the model to find the posterior inplace of running the simulator.

History matching waves

The likelihood is too difficult to model, so we model the log-likelihoodinstead.

G (θ) = log L(θ), L(θi ) =1

N

∑π(D|Xi ), Xi ∼ π(X |θi )

However, the log-likelihood for a typical problem ranges across too wide arange of values.

Consequently, any Gaussian process model will struggle to model thelog-likelihood across the entire input range.

Introduce waves of history matching, similar to those used in MichaelGoldstein’s work.

In each wave, build a GP model that can rule out regions of space asimplausible.



G (θ) = log L(θ), L(θi ) =1

N








G (θ) = log L(θ), L(θi ) =1

N






Results - Design 1 - 128 pts

Diagnostics for GP 1 - threshold = 5.6

Results - Design 2 - 314 pts - 38% of space implausible

Diagnostics for GP 2 - threshold = -21.8

Design 3 - 149 pts - 62% of space implausible


Design 4 - 400 pts - 95% of space implausible


MCMC Results

3.0 3.5 4.0 4.5 5.0

01

23

45

67

Wood’s MCMC posterior

r

Density

0.0 0.2 0.4 0.6 0.8

0.0

1.0

2.0

3.0

Green = GP posterior

sig.e

Density

5 10 15 20

0.0

0.2

0.4

Black = Wood’s MCMC

phi

Density

Computational details

The Wood MCMC method used 105 × 500 simulator runs

The GP code used (128 + 314 + 149 + 400) = 991× 500 simulatorruns

I 1/100th of the number used by Wood’s method.

By the final iteration, the Gaussian processes had ruled out over 98% ofthe original input space as implausible,

the MCMC sampler did not need to waste time exploring thoseregions.

ConclusionsABC allows inference in models for which it would otherwise beimpossible.

not a silver bullet - if likelihood methods possible, use them instead.

Algorithms and post-hoc regression can greatly improve computationalefficiency, but computation is still usually the limiting factor.

Challenge is to develop more efficient methods to allow inference inmore expensive models.

Areas for improvement (particularly those relevant to ML)?

Automatic summary selection and dimension reduction

Improved modelling in regression adjustments

Learning of model error πε(D|X )

Accelerated inference via likelihood modelling

Use of variational methods

. . .

Thank you for listening!

[email protected], www.maths.nottingham.ac.uk/personal/pmzrdw/











. . .













. . .



References - basics

Included in order of appearance in tutorial, rather than importance! Far from

exhaustive - apologies to those I’ve missed

Murray, Ghahramani, MacKay, NIPS, 2012

Tanaka, Francis, Luciani and Sisson, Genetics 2006.

Wilkinson, Tavare, Theoretical Population Biology, 2009,

Neal and Huang, arXiv, 2013.

Beaumont, Zhang, Balding, Genetics 2002

Tavare, Balding, Griffiths, Genetics 1997

Diggle, Gratton, JRSS Ser. B, 1984

Rubin, Annals of Statistics, 1984

Wilkinson, SAGMB 2013.

Fearnhead and Prangle, JRSS Ser. B, 2012

Kennedy and O’Hagan, JRSS Ser. B, 2001

References - algorithms

Marjoram, Molitor, Plagnol, Tavare, PNAS, 2003

Sisson, Fan, Tanaka, PNAS, 2007

Beaumont, Cornuet, Marin, Robert, Biometrika, 2008

Toni, Welch, Strelkowa, Ipsen, Stumpf, Interface, 2009.

Del Moral, Doucet, Stat. Comput. 2011

Drovandi, Pettitt, Biometrics, 2011.

Lee, Proc 2012 Winter Simulation Conference, 2012.

Lee, Latuszynski, arXiv, 2013.

Del Moral, Doucet, Jasra, JRSS Ser. B, 2006.

Sisson and Fan, Handbook of MCMC, 2011.

References - links to other algorithms

Craig, Goldstein, Rougier, Seheult, JASA, 2001

Fearnhead and Prangle, JRSS Ser. B, 2011.

Wood Nature, 2010

Nott and Marshall, Water resources research, 2012

Nott, Fan, Marshall and Sisson, arXiv, 2012.

GP-ABC:

Wilkinson, arXiv, 2013

Meeds and Welling, arXiv, 2013.

References - regression adjustment

Beaumont, Zhang, Balding, Genetics, 2002

Blum, Francois, Stat. Comput. 2010

Blum, JASA, 2010

Leuenberger, Wegmann, Genetics, 2010

References - summary statistics

Blum, Nunes, Prangle, Sisson, Stat. Sci., 2012

Joyce and Marjoram, Stat. Appl. Genet. Mol. Biol., 2008

Nunes and Balding, Stat. Appl. Genet. Mol. Biol., 2010

Fearnhead and Prangle, JRSS Ser. B, 2011

Wilkinson, PhD thesis, University of Cambridge, 2007

Grelaud, Robert, Marin Comptes Rendus Mathematique, 2009

Robert, Cornuet, Marin, Pillai PNAS, 2011

Didelot, Everitt, Johansen, Lawson, Bayesian analysis, 2011.

Approximate Bayesian computation (ABC) 1cm NIPS Tutorial · Approximate Bayesian computation (ABC) NIPS Tutorial Richard Wilkinson [email protected] School of Mathematical

Documents