Approximate Bayesian computation (ABC) NIPS Tutorial Richard Wilkinson [email protected] School of Mathematical Sciences University of Nottingham December 5 2013
Jul 15, 2020
Approximate Bayesian computation (ABC)
NIPS Tutorial
Richard [email protected]
School of Mathematical SciencesUniversity of Nottingham
December 5 2013
Computer experiments
Rohrlich (1991): Computer simulation is
‘a key milestone somewhat comparable to the milestone thatstarted the empirical approach (Galileo) and the deterministicmathematical approach to dynamics (Newton and Laplace)’
Challenges for statistics:How do we make inferences about the world from a simulation of it?
how do we relate simulators to reality? (model error)
how do we estimate tunable parameters? (calibration)
how do we deal with computational constraints? (stat. comp.)
how do we make uncertainty statements about the world thatcombine models, data and their corresponding errors? (UQ)
Computer experiments
Rohrlich (1991): Computer simulation is
‘a key milestone somewhat comparable to the milestone thatstarted the empirical approach (Galileo) and the deterministicmathematical approach to dynamics (Newton and Laplace)’
Challenges for statistics:How do we make inferences about the world from a simulation of it?
how do we relate simulators to reality? (model error)
how do we estimate tunable parameters? (calibration)
how do we deal with computational constraints? (stat. comp.)
how do we make uncertainty statements about the world thatcombine models, data and their corresponding errors? (UQ)
Calibration
For most simulators we specify parameters θ and i.c.s and thesimulator, f (θ), generates output X .
We are interested in the inverse-problem, i.e., observe data D, wantto estimate parameter values θ which explain this data.
For Bayesians, this is aquestion of finding theposterior distribution
π(θ|D) ∝ π(θ)π(D|θ)
posterior ∝prior× likelihood
Intractability
π(θ|D) =π(D|θ)π(θ)
π(D)
usual intractability in Bayesian inference is not knowing π(D).
a problem is doubly intractable if π(D|θ) = cθp(D|θ) with cθunknown (cf Murray, Ghahramani and MacKay 2006)
a problem is completely intractable if π(D|θ) is unknown and can’tbe evaluated (unknown is subjective). I.e., if the analytic distributionof the simulator, f (θ), run at θ is unknown.
Completely intractable models are where we need to resort to ABCmethods
Common exampleTanaka et al. 2006, Wilkinson et al. 2009, Neal and Huang 2013 etc
Many models have unobserved branching processes that lead to the datamaking calculation difficult. For example, the density of the cumulativeprocess is unknown in general.
Approximate Bayesian Computation (ABC)
Given a complex simulator for which we can’t calculate the likelihoodfunction - how do we do inference?
If its cheap to simulate, then ABC (approximate Bayesian computation)isone of the few approaches we can use.
ABC algorithms are a collection of Monte Carlo methods used forcalibrating simulators
they do not require explicit knowledge of the likelihood function
inference is done using simulation from the model (they are‘likelihood-free’).
Approximate Bayesian Computation (ABC)
Given a complex simulator for which we can’t calculate the likelihoodfunction - how do we do inference?
If its cheap to simulate, then ABC (approximate Bayesian computation)isone of the few approaches we can use.
ABC algorithms are a collection of Monte Carlo methods used forcalibrating simulators
they do not require explicit knowledge of the likelihood function
inference is done using simulation from the model (they are‘likelihood-free’).
Approximate Bayesian computation (ABC)
ABC methods are primarily popular in biological disciplines, particularlygenetics and epidemiology, and this looks set to continue growing.
Simple to implement
Intuitive
Embarrassingly parallelizable
Can usually be applied
ABC methods can be crude but they have an important role to play.
First ABC paper candidates
Beaumont et al. 2002
Tavare et al. 1997 or Pritchard et al. 1999
Or Diggle and Gratton 1984 or Rubin 1984
. . .
Approximate Bayesian computation (ABC)
ABC methods are primarily popular in biological disciplines, particularlygenetics and epidemiology, and this looks set to continue growing.
Simple to implement
Intuitive
Embarrassingly parallelizable
Can usually be applied
ABC methods can be crude but they have an important role to play.
First ABC paper candidates
Beaumont et al. 2002
Tavare et al. 1997 or Pritchard et al. 1999
Or Diggle and Gratton 1984 or Rubin 1984
. . .
Tutorial Plan
Part I
i. Basics
ii. Efficient algorithms
iii. Links to other approaches
Part II
iv. Regression adjustments/ post-hoc corrections
v. Summary statistics
vi. Accelerating ABC using Gaussian processes
Basics
‘Likelihood-Free’ Inference
Rejection Algorithm
Draw θ from prior π(·)Accept θ with probability π(D | θ)
Accepted θ are independent draws from the posterior distribution,π(θ | D).
If the likelihood, π(D|θ), is unknown:
‘Mechanical’ Rejection Algorithm
Draw θ from π(·)Simulate X ∼ f (θ) from the computer model
Accept θ if D = X , i.e., if computer output equals observation
The acceptance rate is∫P(D|θ)π(θ)dθ = P(D).
The number of runs to get n observations is negative binomial, with meann
P(D) : ⇒ Bayes Factors!
‘Likelihood-Free’ Inference
Rejection Algorithm
Draw θ from prior π(·)Accept θ with probability π(D | θ)
Accepted θ are independent draws from the posterior distribution,π(θ | D).If the likelihood, π(D|θ), is unknown:
‘Mechanical’ Rejection Algorithm
Draw θ from π(·)Simulate X ∼ f (θ) from the computer model
Accept θ if D = X , i.e., if computer output equals observation
The acceptance rate is∫P(D|θ)π(θ)dθ = P(D).
The number of runs to get n observations is negative binomial, with meann
P(D) : ⇒ Bayes Factors!
Rejection ABC
If P(D) is small (or D continuous), we will rarely accept any θ. Instead,there is an approximate version:
Uniform Rejection Algorithm
Draw θ from π(θ)
Simulate X ∼ f (θ)
Accept θ if ρ(D,X ) ≤ ε
This generates observations from π(θ | ρ(D,X ) < ε):
As ε→∞, we get observations from the prior, π(θ).
If ε = 0, we generate observations from π(θ | D).
ε reflects the tension between computability and accuracy.
For reasons that will become clear later, we call this uniform-ABC.
Rejection ABC
If P(D) is small (or D continuous), we will rarely accept any θ. Instead,there is an approximate version:
Uniform Rejection Algorithm
Draw θ from π(θ)
Simulate X ∼ f (θ)
Accept θ if ρ(D,X ) ≤ ε
This generates observations from π(θ | ρ(D,X ) < ε):
As ε→∞, we get observations from the prior, π(θ).
If ε = 0, we generate observations from π(θ | D).
ε reflects the tension between computability and accuracy.
For reasons that will become clear later, we call this uniform-ABC.
ε = 10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−10
010
20
theta vs D
theta
D
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●● ●
●●
●
●
●
●
●
●
● ● ●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●● ●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●●●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●● ●
●
● ●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
− ε
+ ε
D
−3 −2 −1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Density
theta
Den
sity
ABCTrue
θ ∼ U[−10, 10], X ∼ N(2(θ + 2)θ(θ − 2), 0.1 + θ2)
ρ(D,X ) = |D − X |, D = 2
ε = 7.5
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●● ●
●
●
● ●●
●
●
●
●
●
●
● ●
●
● ●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●●●
●
●
●●
●
●
●●●
●
●●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
−3 −2 −1 0 1 2 3
−10
010
20
theta vs D
theta
D
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●●● ●
●●
●
●
●
●
●
●
● ● ●●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●● ●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●● ●
●●
●
●
●
●
●
●●
●●
●
●
●
●●
●●●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●●
●●
●
●
●
●
● ●
●●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●● ●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●●
●
●
●●
●●
●●
●● ●
● ●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
− ε
+ ε
D
−3 −2 −1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Density
theta
Den
sity
ABCTrue
ε = 5
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
● ●●
●
●
●
●
●● ●
● ●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
● ●●●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●●
●
●
● ●
● ●
●
●
●●
●
●●● ●
●●
●
●
●
●
●
●
●●
●
●●●
●
●
●
● ●●
●
●
●
●
●●
●
●●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
−3 −2 −1 0 1 2 3
−10
010
20
theta vs D
theta
D
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●●
●
●
●●● ●
●●
●
● ●
●
● ● ●●
●
●●
●●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●●●
●● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●● ●
●●
●
●
●
●●
●● ●
●
●●
●●●●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●●
●
●●
●
●
●●
●
●
●
● ●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
● ●● ●
●
●●
●●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
● ●
●
●●●
●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●●
●
●● ●
●
●●
●●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●●
●
●
●
●
●
− ε
+ ε
D
−3 −2 −1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Density
theta
Den
sity
ABCTrue
ε = 2.5
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●● ●
●
●
●●
●
●
●
●
●
●
●
●
●
● ● ●●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
● ●●
●
●
●●
●●
●
●
●
●
●
●
●
●
● ●
●●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●●
● ●
●
●
●
●
● ●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●● ●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●●●
●
●
●●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
−3 −2 −1 0 1 2 3
−10
010
20
theta vs D
theta
D
●
●●
●
●●
●
●
●
●
●●
●●●●
●
● ●●
●
●●
●●
●●
●●●●
●
●●●
● ●●
●●
●●
●
● ●
●
●
●●
●
●●●
●
●
●●
●
●●
●●
●●
●
●●
●●●●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●● ●
●
●
●
●
●
●●●
●
●●
●
●
●●
●●●
●
●
●
●
●●
●
●
●●
●
●
●
●●●
●●
●●
●
●
●●
●
●
●
●●●
●
●●●
●
●
●●●
● ●●
●
●●
●●●
●●
●●
●●
●●
●
●●●●
●●
●
●
●●
●●
●
●
●
●
●●●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●● ●
●
●●
●
●●
●● ●
●●
●●
●
●●●
●
●●
●●
●
●
●
●●
● ●
●
●
●
●
●
●
●●
●●
●
●● ●
●
●●
●
●
− ε
+ ε
D
−3 −2 −1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Density
theta
Den
sity
ABCTrue
ε = 1
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
● ●
●●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
−3 −2 −1 0 1 2 3
−10
010
20
theta vs D
theta
D ●●● ●●●
● ●●●●●
●● ●
●●
●●● ●●● ●
●●
●●
●●
●●
●●●●●
●
●●●●●
●
●●●
●●●
● ●●●●
● ●●●●
●●
●●
●●
●●●●●
●●
●●
●●
●● ●● ●
● ●● ●●●
●
● ●● ● ●
●●●●●
− ε
+ ε
−3 −2 −1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Density
theta
Den
sity
ABCTrue
Rejection ABC
If the data are too high dimensional we never observe simulations that are‘close’ to the field data - curse of dimensionalityReduce the dimension using summary statistics, S(D).
Approximate Rejection Algorithm With Summaries
Draw θ from π(θ)
Simulate X ∼ f (θ)
Accept θ if ρ(S(D), S(X )) < ε
If S is sufficient this is equivalent to the previous algorithm.
Simple → Popular with non-statisticians
Rejection ABC
If the data are too high dimensional we never observe simulations that are‘close’ to the field data - curse of dimensionalityReduce the dimension using summary statistics, S(D).
Approximate Rejection Algorithm With Summaries
Draw θ from π(θ)
Simulate X ∼ f (θ)
Accept θ if ρ(S(D), S(X )) < ε
If S is sufficient this is equivalent to the previous algorithm.
Simple → Popular with non-statisticians
Two ways of thinking
We think about linear regression in two ways
Algorithmic: find the straight line that minimizes the sum of squarederrors
Probabilistic: a linear model with Gaussian errors fit using MAPestimates.
Kalman filter:
Algorithmic: linear quadratic estimation - find the best guess at thetrajectory using linear dynamics and a quadratic penalty function
Probabilistic: the (Bayesian) solution to the linear Gaussian filteringproblem.
The same dichotomy exists for ABC.
Algorithmic: find a good metric, tolerance and summary etc
Probabilistic: What model does ABC correspond to, and how shouldthis inform our choices?
Two ways of thinking
We think about linear regression in two ways
Algorithmic: find the straight line that minimizes the sum of squarederrors
Probabilistic: a linear model with Gaussian errors fit using MAPestimates.
Kalman filter:
Algorithmic: linear quadratic estimation - find the best guess at thetrajectory using linear dynamics and a quadratic penalty function
Probabilistic: the (Bayesian) solution to the linear Gaussian filteringproblem.
The same dichotomy exists for ABC.
Algorithmic: find a good metric, tolerance and summary etc
Probabilistic: What model does ABC correspond to, and how shouldthis inform our choices?
Two ways of thinking
We think about linear regression in two ways
Algorithmic: find the straight line that minimizes the sum of squarederrors
Probabilistic: a linear model with Gaussian errors fit using MAPestimates.
Kalman filter:
Algorithmic: linear quadratic estimation - find the best guess at thetrajectory using linear dynamics and a quadratic penalty function
Probabilistic: the (Bayesian) solution to the linear Gaussian filteringproblem.
The same dichotomy exists for ABC.
Algorithmic: find a good metric, tolerance and summary etc
Probabilistic: What model does ABC correspond to, and how shouldthis inform our choices?
Modelling interpretation - Calibration frameworkWilkinson 2008/2013
We can show that ABC is “exact”, but for a different model to thatintended.πABC (D|θ) is not just the simulator likelihood function:
πABC (D|θ) =
∫πε(D|x)π(x |θ)dx
πε(D|x) is a pdf relating the simulator output to reality - call it theacceptance kernel.π(x |θ) is the likelihood function of the simulator (ie not relating toreality)
Common way of thinking (Kennedy and O’Hagan 2001):
Relate the best-simulator run (X = f (θ)) to reality ζRelate reality ζ to the observations D.
θ f (θ) ζ D
sim error meas error
December 3, 2013 1 / 1
Calibration framework
The posterior is
πABC (θ|D) =1
Z
∫πε(D|x)π(x |θ)dx. π(θ)
where Z =∫∫
πε(D|x)π(x |θ)dxπ(θ)dθ
To simplify matters, we can work in joint (θ, x) space
πABC (θ, x |D) =πε(D|x)π(x |θ)π(θ)
Z
NB: we can allow πε(D|X ) to depend on θ.
Calibration framework
The posterior is
πABC (θ|D) =1
Z
∫πε(D|x)π(x |θ)dx. π(θ)
where Z =∫∫
πε(D|x)π(x |θ)dxπ(θ)dθ
To simplify matters, we can work in joint (θ, x) space
πABC (θ, x |D) =πε(D|x)π(x |θ)π(θ)
Z
NB: we can allow πε(D|X ) to depend on θ.
How does ABC relate to calibration?
Consider how this relates to ABC:
πABC (θ, x) := π(θ, x |D) =πε(D|x)π(x |θ)π(θ)
Z
Lets sample from this using the rejection algorithm with instrumentaldistribution
g(θ, x) = π(x |θ)π(θ)
Note: supp(πABC ) ⊆ supp(g) and that there exists a constant
M = maxx π(D|X )Z such that
πABC (θ, x) ≤ Mg(θ, x) ∀ (θ, x)
How does ABC relate to calibration?
Consider how this relates to ABC:
πABC (θ, x) := π(θ, x |D) =πε(D|x)π(x |θ)π(θ)
Z
Lets sample from this using the rejection algorithm with instrumentaldistribution
g(θ, x) = π(x |θ)π(θ)
Note: supp(πABC ) ⊆ supp(g) and that there exists a constant
M = maxx π(D|X )Z such that
πABC (θ, x) ≤ Mg(θ, x) ∀ (θ, x)
Generalized ABC (GABC)Wilkinson 2008, Fearnhead and Prangle 2012
The rejection algorithm then becomes
Generalized rejection ABC (Rej-GABC)
1 θ ∼ π(θ) and X ∼ π(x |θ) (ie (θ,X ) ∼ g(·))
2 Accept (θ,X ) if
U ∼ U[0, 1] ≤ πABC (θ, x)
Mg(θ, x)=
πε(D|X )
maxx πε(D|x)
In uniform ABC we take
πε(D|X ) =
{1 if ρ(D,X ) ≤ ε0 otherwise
this reduces the algorithm to
2’ Accept θ ifF ρ(D,X ) ≤ εie, we recover the uniform ABC algorithm.
Generalized ABC (GABC)Wilkinson 2008, Fearnhead and Prangle 2012
The rejection algorithm then becomes
Generalized rejection ABC (Rej-GABC)
1 θ ∼ π(θ) and X ∼ π(x |θ) (ie (θ,X ) ∼ g(·))
2 Accept (θ,X ) if
U ∼ U[0, 1] ≤ πABC (θ, x)
Mg(θ, x)=
πε(D|X )
maxx πε(D|x)
In uniform ABC we take
πε(D|X ) =
{1 if ρ(D,X ) ≤ ε0 otherwise
this reduces the algorithm to
2’ Accept θ ifF ρ(D,X ) ≤ εie, we recover the uniform ABC algorithm.
Uniform ABC algorithm
This allows us to interpret uniform ABC. Suppose X ,D ∈ R
Proposition
Accepted θ from the uniform ABC algorithm (with ρ(D,X ) = |D − X |)are samples from the posterior distribution of θ given D where we assumeD = f (θ) + e and that
e ∼ U[−ε, ε]
In general, uniform ABC assumes that
D|x ∼ U{d : ρ(d , x) ≤ ε}
i.e., D is generated by adding noise uniformly chosen from a ball of radiusε around the best simulator output f (θ).
ABC gives ‘exact’ inference under a different model!
Uniform ABC algorithm
This allows us to interpret uniform ABC. Suppose X ,D ∈ R
Proposition
Accepted θ from the uniform ABC algorithm (with ρ(D,X ) = |D − X |)are samples from the posterior distribution of θ given D where we assumeD = f (θ) + e and that
e ∼ U[−ε, ε]
In general, uniform ABC assumes that
D|x ∼ U{d : ρ(d , x) ≤ ε}
i.e., D is generated by adding noise uniformly chosen from a ball of radiusε around the best simulator output f (θ).
ABC gives ‘exact’ inference under a different model!
Acceptance Kernel - π(D|x)Kennedy and O’Hagan 2001, Goldstein and Rougier 2009
How do we relate the simulator to reality?
1 Measurement error - D = ζ + e - let πε(D|X ) be the distribution e.
2 Model error - ζ = f (θ) + δ - let πε(D|X ) be the distribution ε.
Or both: πε(D|x) a convolution of the two distributions
3 Sampling a hidden space - often the data D are noisy observations ofsome latent feature (call it X ), which is generated by a stochasticprocess. By removing the stochastic sampling from the simulator wecan let π(D|x) do the sampling for us (Rao-Blackwellisation).
Kernel SmoothingBlum 2010, Fearnhead and Prangle 2012
Viewing ABC as an extension of modelling isn’t commonly done.
allows us to do the inference we want (and to interpret)I - makes explicit the relationship between simulator and observations.
allows for the possibility of more efficient ABC algorithms
A different but equivalent view of ABC is as kernel smoothing
πABC (θ|D) ∝∫
Kε(D − x)π(x |θ)π(θ)dx
where Kε(x) = 1/εK (x/ε) and K is a standard kernel and ε is thebandwidth.
Kernel SmoothingBlum 2010, Fearnhead and Prangle 2012
Viewing ABC as an extension of modelling isn’t commonly done.
allows us to do the inference we want (and to interpret)I - makes explicit the relationship between simulator and observations.
allows for the possibility of more efficient ABC algorithms
A different but equivalent view of ABC is as kernel smoothing
πABC (θ|D) ∝∫
Kε(D − x)π(x |θ)π(θ)dx
where Kε(x) = 1/εK (x/ε) and K is a standard kernel and ε is thebandwidth.
Efficient Algorithms
References:
Marjoram et al. 2003
Sisson et al. 2007
Beaumont et al. 2008
Toni et al. 2009
Del Moral et al. 2011
Drovandi et al. 2011
ABCifying Monte Carlo methods
Rejection ABC is the basic ABC algorithm.
Inefficient as it repeatedly samples from prior
A large number of papers have been published turning other MCalgorithms into ABC type algorithms for when we don’t know thelikelihood: IS, MCMC, SMC, EM, EP etc
Focus on MCMC and SMC
presented for GABC with acceptance kernels, but most thealgorithms were written down for uniform ABC, i.e.,
πε(D|X ) = Iρ(D,X )≤ε
and we can make this choice in most cases if desired.
MCMC-ABCMarjoram et al. 2003
We are targeting the joint distribution
πABC (θ, x |D) ∝ πε(D|x)π(x |θ)π(θ)
To explore the (θ, x) space, proposals of the form
Q((θ, x), (θ′, x ′)) = q(θ, θ′)π(x ′|θ′)seem to be inevitable (q arbitrary).
The Metropolis-Hastings (MH) acceptance probability is then
r =πABC (θ′|D)Q((θ′, x ′), (θ, x))
πABC (θ|D)Q((θ, x), (θ′, x ′))
=πε(D|x ′)π(x ′|θ′)π(θ′)q(θ′, θ)π(x |θ)
πε(D|x)π(x |θ)π(θ)q(θ, θ′)π(x ′|θ′)
=πε(D|x ′)q(θ′, θ)π(θ′)πε(D|x)q(θ, θ′)π(θ)
MCMC-ABCMarjoram et al. 2003
We are targeting the joint distribution
πABC (θ, x |D) ∝ πε(D|x)π(x |θ)π(θ)
To explore the (θ, x) space, proposals of the form
Q((θ, x), (θ′, x ′)) = q(θ, θ′)π(x ′|θ′)seem to be inevitable (q arbitrary).
The Metropolis-Hastings (MH) acceptance probability is then
r =πABC (θ′|D)Q((θ′, x ′), (θ, x))
πABC (θ|D)Q((θ, x), (θ′, x ′))
=πε(D|x ′)π(x ′|θ′)π(θ′)q(θ′, θ)π(x |θ)
πε(D|x)π(x |θ)π(θ)q(θ, θ′)π(x ′|θ′)
=πε(D|x ′)q(θ′, θ)π(θ′)πε(D|x)q(θ, θ′)π(θ)
MCMC-ABCMarjoram et al. 2003
We are targeting the joint distribution
πABC (θ, x |D) ∝ πε(D|x)π(x |θ)π(θ)
To explore the (θ, x) space, proposals of the form
Q((θ, x), (θ′, x ′)) = q(θ, θ′)π(x ′|θ′)seem to be inevitable (q arbitrary).
The Metropolis-Hastings (MH) acceptance probability is then
r =πABC (θ′|D)Q((θ′, x ′), (θ, x))
πABC (θ|D)Q((θ, x), (θ′, x ′))
=πε(D|x ′)π(x ′|θ′)π(θ′)q(θ′, θ)π(x |θ)
πε(D|x)π(x |θ)π(θ)q(θ, θ′)π(x ′|θ′)
=πε(D|x ′)q(θ′, θ)π(θ′)πε(D|x)q(θ, θ′)π(θ)
This gives the following MCMC kernel
MH-ABC - PMarj(θ0, ·)1 Propose a move from zt = (θ, x) to (θ′, x ′) using proposal Q above.
2 Accept move with probability
r((θ, x), (θ′, x ′)) = min
(1,πε(D|x ′)q(θ′, θ)π(θ′)πε(D|x)q(θ, θ′)π(θ)
),
otherwise set zt+1 = zt .
In practice, we find this algorithm often gets stuck at a given θ, as theprobability of generating x ′ near D can be tiny if ε is small.
Note that this is a special case of a ”pseudo marginal”Metropolis-Hastings algorithm, and can be modified to use multiplesimulations at each θ, i.e.
r = min
(1,
∑Ni=1 πε(D|x ′i )q(θ′, θ)π(θ′)∑Ni=1 πε(D|xi )q(θ, θ′)π(θ)
)to better approximate the likelihood.
This gives the following MCMC kernel
MH-ABC - PMarj(θ0, ·)1 Propose a move from zt = (θ, x) to (θ′, x ′) using proposal Q above.
2 Accept move with probability
r((θ, x), (θ′, x ′)) = min
(1,πε(D|x ′)q(θ′, θ)π(θ′)πε(D|x)q(θ, θ′)π(θ)
),
otherwise set zt+1 = zt .
In practice, we find this algorithm often gets stuck at a given θ, as theprobability of generating x ′ near D can be tiny if ε is small.
Note that this is a special case of a ”pseudo marginal”Metropolis-Hastings algorithm, and can be modified to use multiplesimulations at each θ, i.e.
r = min
(1,
∑Ni=1 πε(D|x ′i )q(θ′, θ)π(θ′)∑Ni=1 πε(D|xi )q(θ, θ′)π(θ)
)to better approximate the likelihood.
Recent developments - Lee 2012
1-hit MCMC kernel - P1hit(θ0, ·)1 Propose θ′ ∼ q(θt , ·)2 With probability
1−min
(1,
q(θ′, θt)π(θ′)q(θt , θ′)π(θt)
)set θt+1 = θt
3 Sample x ′ ∼ π(·|θ′) and x ∼ π(·|θt) until ρ(x ′,D) ≤ ε orρ(x ,D) ≤ ε.
4 If ρ(x ′,D) ≤ ε set θt+1 = θ′ otherwise set θt+1 = θt
Recent developmentsLee et al. 2013 showed PMarj is neither
variance boundingI Let Eh(θ) = 1
m
∑h(θi ) - Markov kernel P is variance bounding if
VarP(Eh(θ)) is ”reasonably small”
nor geometrically ergodic (GE) i.e ||Pm(θ0, ·)− πABC (·)||TV ≤ Cρm
where ρ < 1. Markov kernels that are not GE may convergenceextremely slowly.
whereas P1hit is (subject to conditions).
(a) ✓1 (b) ✓2 (c) ✓3
Figure 6: Density estimates of the marginal posteriors for the Lotka-Volterra model.
0.79
0.80
0.81
0.82
0.83
0e+00 1e+07 2e+07 3e+07 4e+07 5e+07
(a) P1,1
0.79
0.80
0.81
0.82
0.83
0e+00 1e+06 2e+06 3e+06 4e+06 5e+06
(b) P1,15
0.79
0.80
0.81
0.82
0.83
0e+00 1e+06 2e+06 3e+06 4e+06 5e+06
(c) P2,15
0.79
0.80
0.81
0.82
0.83
0e+00 1e+06 2e+06 3e+06 4e+06 5e+06
(d) P3
Figure 7: Estimates of the posterior mean of ✓3 by iteration using each kernel. The three horizontal linescorrespond to the estimate obtained using the rejection sampler with two estimated standard deviationsadded and subtracted.
and historical uses in statistics can be traced through Feller (1940); Doob (1945); Kendall (1949, 1950), andthe method was rediscovered in Gillespie (1977) in the context of stochastic kinetic models. These articlesdevelop a straightforward way to simulate the full process X1:2(t), t 2 [0, 10] as the inter-jump times areexponential random variables, although more sophisticated approaches are possible (see, e.g., Wilkinson,2011, Chapter 8).
The data was simulated with ✓ = (1, 0.005, 0.6), an example from Wilkinson (2011, p. 152). Our observa-tions are both partial and discrete with y = {88, 165, 274, 268, 114, 46, 32, 36, 53, 92} the simulated valuesof X1 at times {1, 2, . . . , 10}, and for approximate Bayesian computation we use a log transformation ofX1(t) and y(t) with ✏ = 1, i.e.
B✏(y) = {X1(t) : 8i 2 {1, . . . , 10}, log(X1(i))� log(y(i)) ✏} .
We first model ✓ 2 ⇥ = [0,1)3 with p(✓) = 100 exp(�✓1 � 100✓2 � ✓3) and use q(✓,#) = N (#; ✓,⌃)where ⌃ = diag(.25, 0.0025, .25). Density plots of the marginal posteriors for each component of ✓ areshown in Figure 6, obtained using 106 samples from ⇡ using a rejection sampler. ✓1 has a tighter posteriorthan ✓3 and while not shown here, the samples indicate strong positive correlation between ✓2 and ✓3. In thissetting, P3 for 5⇥ 106 iterations gave an average value of n of 15 and we also ran kernels P1,1 = P2,1 for5⇥107 iterations and P1,15 and P2,15 both for 5⇥106 iterations. All kernels gave density estimates visiblyindistinguishable from those in Figure 6, but inspection of their partial sums by iteration reveals importantdifferences. In Figures 7 and 8 we show estimates of the posterior mean of ✓3 and the probability that✓3 � 1.79 for each chain, accompanied by lines corresponding to the estimate obtained using the samplesfrom the rejection sampler. P3 seems to accurately estimate both the same value as the estimate from therejection sampler and the uncertainty of the estimate seems to be correlated with perturbations of the partialsum. However, the other kernels seem to both miss the value of interest by some amount and, particularly
11
Note that P1hit requires significantly more computation per iteration (butthis may be worth it)
Recent developmentsLee et al. 2013 showed PMarj is neither
variance boundingI Let Eh(θ) = 1
m
∑h(θi ) - Markov kernel P is variance bounding if
VarP(Eh(θ)) is ”reasonably small”
nor geometrically ergodic (GE) i.e ||Pm(θ0, ·)− πABC (·)||TV ≤ Cρm
where ρ < 1. Markov kernels that are not GE may convergenceextremely slowly.
whereas P1hit is (subject to conditions).
(a) ✓1 (b) ✓2 (c) ✓3
Figure 6: Density estimates of the marginal posteriors for the Lotka-Volterra model.
0.79
0.80
0.81
0.82
0.83
0e+00 1e+07 2e+07 3e+07 4e+07 5e+07
(a) P1,1
0.79
0.80
0.81
0.82
0.83
0e+00 1e+06 2e+06 3e+06 4e+06 5e+06
(b) P1,15
0.79
0.80
0.81
0.82
0.83
0e+00 1e+06 2e+06 3e+06 4e+06 5e+06
(c) P2,15
0.79
0.80
0.81
0.82
0.83
0e+00 1e+06 2e+06 3e+06 4e+06 5e+06
(d) P3
Figure 7: Estimates of the posterior mean of ✓3 by iteration using each kernel. The three horizontal linescorrespond to the estimate obtained using the rejection sampler with two estimated standard deviationsadded and subtracted.
and historical uses in statistics can be traced through Feller (1940); Doob (1945); Kendall (1949, 1950), andthe method was rediscovered in Gillespie (1977) in the context of stochastic kinetic models. These articlesdevelop a straightforward way to simulate the full process X1:2(t), t 2 [0, 10] as the inter-jump times areexponential random variables, although more sophisticated approaches are possible (see, e.g., Wilkinson,2011, Chapter 8).
The data was simulated with ✓ = (1, 0.005, 0.6), an example from Wilkinson (2011, p. 152). Our observa-tions are both partial and discrete with y = {88, 165, 274, 268, 114, 46, 32, 36, 53, 92} the simulated valuesof X1 at times {1, 2, . . . , 10}, and for approximate Bayesian computation we use a log transformation ofX1(t) and y(t) with ✏ = 1, i.e.
B✏(y) = {X1(t) : 8i 2 {1, . . . , 10}, log(X1(i))� log(y(i)) ✏} .
We first model ✓ 2 ⇥ = [0,1)3 with p(✓) = 100 exp(�✓1 � 100✓2 � ✓3) and use q(✓,#) = N (#; ✓,⌃)where ⌃ = diag(.25, 0.0025, .25). Density plots of the marginal posteriors for each component of ✓ areshown in Figure 6, obtained using 106 samples from ⇡ using a rejection sampler. ✓1 has a tighter posteriorthan ✓3 and while not shown here, the samples indicate strong positive correlation between ✓2 and ✓3. In thissetting, P3 for 5⇥ 106 iterations gave an average value of n of 15 and we also ran kernels P1,1 = P2,1 for5⇥107 iterations and P1,15 and P2,15 both for 5⇥106 iterations. All kernels gave density estimates visiblyindistinguishable from those in Figure 6, but inspection of their partial sums by iteration reveals importantdifferences. In Figures 7 and 8 we show estimates of the posterior mean of ✓3 and the probability that✓3 � 1.79 for each chain, accompanied by lines corresponding to the estimate obtained using the samplesfrom the rejection sampler. P3 seems to accurately estimate both the same value as the estimate from therejection sampler and the uncertainty of the estimate seems to be correlated with perturbations of the partialsum. However, the other kernels seem to both miss the value of interest by some amount and, particularly
11
Note that P1hit requires significantly more computation per iteration (butthis may be worth it)
Importance sampling GABC
In uniform ABC, importance sampling simply reduces to the rejectionalgorithm with a fixed budget for the number of simulator runs.
But for GABC it opens new algorithms:
GABC - Importance sampling
1 θi ∼ π(θ) and Xi ∼ π(x |θi ).
2 Give (θi , xi ) weight wi = πε(D|xi ).
Which is more efficient - IS-GABC or Rej-GABC?
Proposition 2
IS-GABC has a larger effective sample size than Rej-GABC, or equivalently
VarRej(w) ≥ VarIS(w)
This can be seen as a Rao-Blackwell type result.
Importance sampling GABC
In uniform ABC, importance sampling simply reduces to the rejectionalgorithm with a fixed budget for the number of simulator runs.
But for GABC it opens new algorithms:
GABC - Importance sampling
1 θi ∼ π(θ) and Xi ∼ π(x |θi ).
2 Give (θi , xi ) weight wi = πε(D|xi ).
Which is more efficient - IS-GABC or Rej-GABC?
Proposition 2
IS-GABC has a larger effective sample size than Rej-GABC, or equivalently
VarRej(w) ≥ VarIS(w)
This can be seen as a Rao-Blackwell type result.
Rejection Control (RC)A difficulty with IS algorithms is that they can require the storage of alarge number of particles with small weights.
thin particles with small weights using rejection control:
Rejection Control in IS-GABC
1 θi ∼ π(θ) and Xi ∼ π(X |θi )2 Accept (θi ,Xi ) with probability
r(Xi ) = min
(1,πε(D|Xi )
C
)for any threshold constant C ≥ 0.
3 Give accepted particles weights
wi = max(πε(D|Xi ),C )
IS is more efficient than RC, unless we have memory constraints (relativeto processor time).
Sequential ABC algorithms
The most popular efficient ABC algorithms are those based on sequentialmethods (Sisson et al. 2007, Toni et al. 2008, Beaumont et al. 2009, ....).
We aim to sample N particles successively from a sequence of distributions
π1(θ), . . . , πT (θ) = target
For ABC we decide upon a sequence of tolerances ε1 > ε2 > . . . > εT andlet πt be the ABC distribution found by the ABC algorithm when we usetolerance εt .
Specifically, define a sequence of target distributions
πt(θ, x) =πt(D|x)π(x |θ)π(θ)
Ct=γt(θ, x)
Ct
with πt(D|X ) = πεt (D|X )
ABC SMC (Toni et al., 2009)
(a) As in ABC rejection, we define a priordistribution P (✓) and we would like to approxi-mate a posterior distribution P (✓|D0). In ABCSMC we do this sequentially by constructingintermediate distributions, which convergeto the posterior distribution. We define atolerance schedule ✏1 > ✏2 > . . . ✏T � 0.
(b) We sample particles from a prior distribu-tion until N particles have been accepted (havereached the distance smaller than ✏1). For allaccepted particles we calculate weights (see[4] for formulas and derivation). We call thesample of all accepted particles ”Population1”.
(c) We then sample a particle ✓⇤ from popu-lation 1 and perturb it to obtain a perturbedparticle ✓⇤⇤ ⇠ K(✓|✓⇤), where K is a per-turbation kernel (for example a Gaussianrandom walk). We then simulate a datasetD⇤ ⇠ f(D|✓⇤⇤) and accept the particle ✓⇤⇤
if d(D0, D⇤⇤) ✏2. We repeat this until we
have accepted N particles in population 2. Wecalculate weights for all accepted particles.
(d) We repeat the same procedure for thefollowing populations, until we have acceptedN particles of the last population T andcalculated their weights. Population T is asample of particles that approximates theposterior distribution.
ABC SMC is computationally much moree�cient than ABC rejection (see [4] forcomparison).
ABC SMC (Sequential Monte Carlo)
Intermediate DistributionsPrior Posterior
✏1 ✏2 . . . ✏T�1 ✏T
Population 1 Population 2 Population T
Tina Toni, Michael Stumpf ABC dynamical systems 03/07/08 1 / 1
(a)
(b)
(c)
(d)
Figure 2: Schematic representation of ABCSMC.
3
Picture from Toni and Stumpf 2010 tutorial
At each stage t, we aim to construct a weighted sample of particles thatapproximates πt(θ, x).{(
z(i)t ,W
(i)t
)}N
i=1such that πt(z) ≈
∑W
(i)t δ
z(i)t
(dz)
where z(i)t = (θ
(i)t , x
(i)t ).
ABC SMC (Toni et al., 2009)
(a) As in ABC rejection, we define a priordistribution P (✓) and we would like to approxi-mate a posterior distribution P (✓|D0). In ABCSMC we do this sequentially by constructingintermediate distributions, which convergeto the posterior distribution. We define atolerance schedule ✏1 > ✏2 > . . . ✏T � 0.
(b) We sample particles from a prior distribu-tion until N particles have been accepted (havereached the distance smaller than ✏1). For allaccepted particles we calculate weights (see[4] for formulas and derivation). We call thesample of all accepted particles ”Population1”.
(c) We then sample a particle ✓⇤ from popu-lation 1 and perturb it to obtain a perturbedparticle ✓⇤⇤ ⇠ K(✓|✓⇤), where K is a per-turbation kernel (for example a Gaussianrandom walk). We then simulate a datasetD⇤ ⇠ f(D|✓⇤⇤) and accept the particle ✓⇤⇤
if d(D0, D⇤⇤) ✏2. We repeat this until we
have accepted N particles in population 2. Wecalculate weights for all accepted particles.
(d) We repeat the same procedure for thefollowing populations, until we have acceptedN particles of the last population T andcalculated their weights. Population T is asample of particles that approximates theposterior distribution.
ABC SMC is computationally much moree�cient than ABC rejection (see [4] forcomparison).
(a)
(b)
(c)
ABC SMC (Sequential Monte Carlo)
Intermediate DistributionsPrior Posterior
✏1 ✏2 . . . ✏T�1 ✏TPopulation 1 Population 2 Population T
Tina Toni, Michael Stumpf ABC dynamical systems 03/07/08 1 / 1
(d)
Figure 2: Schematic representation of ABCSMC.
3
Picture from Toni and Stumpf 2010 tutorial
Toni et al. (2008)Assume we have a cloud of weighted particles {(θi ,wi )}Ni=1 that wereaccepted at step t − 1.
1 Sample θ from the previous population according to the weights.2 Perturb the particles according to perturbation kernel qt . I.e.,
θ ∼ qt(θ, ·)3 Reject particle immediately if θ has zero prior density, i.e., if
π(θ) = 0
4 Otherwise simulate X ∼ f (θ) from the simulator. Ifρ(S(X ), S(D)) ≤ εt accept the particle, otherwise reject.
5 Give the accepted particle weight
wi =π(θ)∑
θiqt(θi , θ)
6 Repeat steps 1-5 until we have N accepted particles at step t.
Sequential Monte Carlo (SMC)All the SMC-ABC algorithms can be understood as special cases of DelMoral et al. 2006.
If at stage t we use proposal distribution ηt(z) for the particles, then wecreate the weighted sample as follows:
Generic Sequential Monte Carlo - stage n
(i) For i = 1, . . . ,N
Z(i)t ∼ ηt(z)
and correct between ηt and πt
wt(Z(i)t ) =
γt(Z(i)t )
ηt(Z(i)t )
(ii) Normalize to find weights {W (i)t }.
(iii) If effective sample size (ESS) is less than some threshold T,
resample the particles and set W(i)t = 1/N. Set t = t + 1.
Del Moral et al. SMC algorithm
We can build the proposal distribution ηt(z), from the particles availableat time t − 1.
One way to do this is to propose new particles by passing the old particlesthrough a Markov kernel qt(z , z
′).
For i = 1, . . . ,N
z(i)n ∼ qt(z
(i)t−1, ·)
This makes ηt(z) =∫ηt−1(z ′)qt(z ′, z)dz′ – which is unknown in general.
Del Moral et al. 2006 showed how to avoid this problem by introducing asequence of backward kernels, Lt−1.
Del Moral et al. 2006 SMC algorithm - step n
(i) Propagate: Extend the particle paths using Markov kernel qt .
For i = 1, . . . ,N, Z(i)n ∼ Qt(z
(i)t−1, ·)
(ii) Weight: Correct between ηn(z0:n) and πt(z0:n). For i = 1, . . . ,N
wt(z(i)0:n) =
γt(z(i)0:n)
ηt(z(i)0:n)
(1)
= Wt−1(z(i)0:n−1)wt(z
(i)t−1, z
(i)t ) (2)
where
wt(z(i)t−1, z
(i)t ) =
γt(z(i)t )Lt−1(z
(i)t , z
(i)t−1)
γt−1(z(i)t−1)Qt(z
(i)t−1, z
(i)t )
(3)
is the incremental weight.
(iii) Normalise the weights to obtain {W (i)t }.
(iv) Resample if ESS< T and set W(i)n = 1/N for all i . Set n = n + 1.
SMC with partial rejection control (PRC)
We can add in the rejection control idea of Liu
Del Moral SMC algorithm with Partial Rejection Control - step n
(i) For i = 1, . . . ,N
(a) Sample z∗ from {z (i)t−1} according to weights W(i)t−1.
(b) Perturb:z∗∗ ∼ Qt(z
∗, ·)(c) Weight
w∗ =γt(z
(i)t )Lt−1(z
(i)t , z
(i)t−1)
γt−1(z(i)t−1)Qt(z
(i)t−1, z
(i)t )
(d) PRC: Accept z∗ with probability min(1, w∗
ct). If accepted set z
(i)t = z∗∗
and set w(i)t = max(w∗, ct). Otherwise return to (a).
(ii) Normalise the weights to get W(i)t .
GABC versions of SMC
We need to choose
Sequence of targets πt
Forward perturbation kernels Kt
Backward kernels Lt
Thresholds ct .
By making particular choices for these quantities we can recover many ofthe published SMC-ABC samplers.
Uniform SMC-ABCFor example,
let πt be the uniform ABC target using εt ,
πt(D|X ) =
{1 if ρ(D,X ) ≤ εt0 otherwise
let Qt(z , z′) = qt(θ, θ
′)π(x ′|θ)
let c1 = 1 and ct = 0 for n ≥ 2
let
Lt−1(zt , zt−1) =πt−1(zt−1)Qt(zt−1, zt)
πt−1Qt(zt)
and approximate πt−1Qt(z) =∫πt−1(z ′)Qt(z
′, z)dz′ by
πt−1Qt(z) ≈∑j
W(j)t−1Qt(z
(j)t−1, z)
then the algorithm reduces to Beaumont et al. 2008. We recover theSisson et al. 2007 algorithm if we add in a further resampling step. Toniet al. 2009 is recovered by including a compulsory resampling step.
Other sequential GABC algorithms
We can combine SMC with MCMC type moves, by using
Lt−1(zt , zt−1) =πt−1(zt−1)Qt(zt−1, zt)
πt−1Qt(zt)
If we then use a πt invariant Metropolis-Hastings kernel Qt and let
Lt−1(zt , zt−1) =πt(zt−1)Qt(zt−1, zt)
πt(zt)
then we get an ABC resample-move algorithm.
Approximate Resample-Move (with PRC)
RM-GABC
(i) While ESS < N
(a) Sample z∗ = (θ∗,X ∗) from {z (i)t−1} according to weights W(i)t−1.
(b) Weight:
w∗ = wt(X∗) =
πt(D|X ∗)πt−1(D|X ∗)
(c) PRC: With probability min(1, w∗
ct), sample
z(i)t ∼ Qt(z
∗, ·)
where Qt is an MCMC kernel with invariant distribution πt . Seti = i + 1.Otherwise, return to (i)(a).
(ii) Normalise the weights to get W(i)t . Set n = n + 1
Note that because the incremental weights are independent of zt we areable to swap the perturbation and PRC steps.
Conclusions
The tolerance ε controls the accuracy of ABC algorithms, and so wedesire to take ε as small as possible in many problems (although notalways).
By using efficient sampling algorithms, we can hope to better use theavailable computation resource to spend more time simulating inregions of parameter space likely to lead to accepted values
MCMC and SMC versions of ABC have been developed, along withABC versions of most other algorithms.
Links to other approaches
History-matchinge.g. Craig et al. 2001, Vernon et al. 2010
ABC can be seen as a probabilistic version of history matching. Historymatching is used in the analysis of computer experiments to rule outregions of space as implausible.
1 Relate the simulator to the system
ζ = f (θ) + ε
where ε is our simulator discrepancy
2 Relate the system to the data (e represents measurement error)
D = ζ + e
3 Declare θ implausible if, e.g.,
‖ D − Ef (θ) ‖> 3σ
where σ2 is the combined variance implied by the emulator,discrepancy and measurement error.
History-matching
If θ is not implausible we don’t discard it. The result is a region of spacethat we can’t rule out at this stage of the history-match.
Usual to go through several stages of history matching.Notes:
History matching can be seen as a principled version of ABC - lots ofthought goes into the link between simulator and reality.
The result of history-matching may be that there is nonot-implausible region of parameter space
I Go away and think harder - something is misspecifiedI This can also happen in rejection ABC.I In contrast, MCMC will always give an answer, even if the model is
terrible.
Noisy-ABCFearnhead and Prangle (2012) proposed the noisy-ABC algorithm:
Noisy-ABC
Initialise: Let D ′ = D + e where e ∼ K (e) from some kernel K (·).
1 θi ∼ π(θ) and Xi ∼ π(x |θi ).
2 Give (θi , xi ) weight wi = K (Xi − D ′).
In our notation, replace the observed data D, with D ′ drawn from theacceptance kernel - D ′ ∼ π(D ′|D)
The main argument in favour of noisy-ABC is that it is calibrated, unlikestandard ABC.
PABC is calibrated if
P(θ ∈ A|Eq(A)) = q
where Eq(A) is the event that the ABC posterior assigns probabilityq to event A i.e., given that PABC (A) = q, then we are calibrated ifA occurs with probability q according to base measure P (defined byprior, simulator likelihood and K ).
Noisy-ABCFearnhead and Prangle (2012) proposed the noisy-ABC algorithm:
Noisy-ABC
Initialise: Let D ′ = D + e where e ∼ K (e) from some kernel K (·).
1 θi ∼ π(θ) and Xi ∼ π(x |θi ).
2 Give (θi , xi ) weight wi = K (Xi − D ′).
In our notation, replace the observed data D, with D ′ drawn from theacceptance kernel - D ′ ∼ π(D ′|D)
The main argument in favour of noisy-ABC is that it is calibrated, unlikestandard ABC.
PABC is calibrated if
P(θ ∈ A|Eq(A)) = q
where Eq(A) is the event that the ABC posterior assigns probabilityq to event A i.e., given that PABC (A) = q, then we are calibrated ifA occurs with probability q according to base measure P (defined byprior, simulator likelihood and K ).
Noisy ABC
Noisy ABC is well calibrated. However, this is a frequency property, andso it only becomes relevant if we repeat the analysis with different D ′
many times
highly relevant to filtering problems
Note that noisy ABC and GABC are trying to do different things:
Noisy ABC moves the data so that it comes from the model we areassuming when we do inference.
I Assumes the model π(D|θ) is true and tries to find the true posteriorgiven the noisy data.
GABC accepts the model is incorrect, and tries to account for this inthe inference.
Other algorithms
Wood 2010 is an ABC algorithm, but using sample mean µθ andcovariance Σθ of the summary of f (θ) run n times at θ, and assuming
π(D|S) = N (D;µθ,Σθ)
(Generalized Likelihood Uncertainty Estimation) GLUE approach ofKeith Beven in hydrology - see Nott and Marshall 2012
Kalman filtering, see Nott et al. 2012.
The dangers of ABC - H.L. Mencken
For every complex problem, there is an answer that is short,simple and wrong
Why use ABC? J. Galsworthy
Idealism increases in direct proportion to ones distance fromthe problem
Recap I
Uniform Rejection ABC
Draw θ from π(θ)
Simulate X ∼ f (θ)
Accept θ if ρ(D,X ) ≤ ε
We’ve looked at a variety of more efficient sampling algorithms
e.g. ABC-MCMC, ABC-IS, ABC-SMC
The higher the efficiency the smaller the tolerance we can use for agiven computational expense.
Recap II
Alternative approaches focus on avoiding the curse of dimensionality:
If the data are too high dimensional we never observe simulationsthat are ‘close’ to the field data
Approaches include
Using summary statistics S(D) to reduce the dimension
Uniform rejection ABC with summaries
I Draw θ from π(θ)
I Simulate X ∼ f (θ)
I Accept θ if ρ(S(D),S(X )) < ε
If S is sufficient this is equivalent to the previous algorithm.
Regression adjustment - model and account for the discrepancybetween S = S(X ) and Sobs = S(D).
Recap II
Alternative approaches focus on avoiding the curse of dimensionality:
If the data are too high dimensional we never observe simulationsthat are ‘close’ to the field data
Approaches include
Using summary statistics S(D) to reduce the dimension
Uniform rejection ABC with summaries
I Draw θ from π(θ)
I Simulate X ∼ f (θ)
I Accept θ if ρ(S(D),S(X )) < ε
If S is sufficient this is equivalent to the previous algorithm.
Regression adjustment - model and account for the discrepancybetween S = S(X ) and Sobs = S(D).
Recap II
Alternative approaches focus on avoiding the curse of dimensionality:
If the data are too high dimensional we never observe simulationsthat are ‘close’ to the field data
Approaches include
Using summary statistics S(D) to reduce the dimension
Uniform rejection ABC with summaries
I Draw θ from π(θ)
I Simulate X ∼ f (θ)
I Accept θ if ρ(S(D),S(X )) < ε
If S is sufficient this is equivalent to the previous algorithm.
Regression adjustment - model and account for the discrepancybetween S = S(X ) and Sobs = S(D).
Regression Adjustment
References:
Beaumont et al. 2003
Blum and Francois 2010
Blum 2010
Leuenberger and Wegmann 2010
Regression Adjustment
An alternative to rejection-ABC, proposed by Beaumont et al. 2002, usespost-hoc adjustment of the parameter values to try to weaken the effectof the discrepancy between s and sobs .
Two key ideas
use non-parametric kernel density estimation to emphasise the bestsimulations
learn a non-linear model for the conditional expectation E(θ|s) as afunction of s and use this to learn the posterior at sobs .
Idea 1 - kernel regression
Suppose we want to estimate
E(θ|sobs) =
∫θπ(θ, sobs)
π(sobs)dθ
using pairs {θi , si} from π(θ, s)
Approximating the two densities using a kernel density estimate
π(θ, s) =1
n
∑i
K (s − si )K (θ − θi ) π(s) =1
n
∑i
K (s − si )
and substituting gives the Naradaya-Watson estimator:
E(θ|sobs) ≈∑
i K (sobs − si )θi∑i K (sobs − si )
as∫yK (y − a)dy = a.
Idea 1 - kernel regression
Suppose we want to estimate
E(θ|sobs) =
∫θπ(θ, sobs)
π(sobs)dθ
using pairs {θi , si} from π(θ, s)
Approximating the two densities using a kernel density estimate
π(θ, s) =1
n
∑i
K (s − si )K (θ − θi ) π(s) =1
n
∑i
K (s − si )
and substituting gives the Naradaya-Watson estimator:
E(θ|sobs) ≈∑
i K (sobs − si )θi∑i K (sobs − si )
as∫yK (y − a)dy = a.
Beaumont et al. 2002 suggested using the Epanechnikov kernel
Kε(x) =c
ε
[1−
(xε
)2]Ix≤ε
as it has finite support - we discard the majority of simulations. Theyrecommend ε be set by deciding on the proportion of simulations tokeep e.g. best 5%
This expression also arises if we view
{θi ,Wi}, with Wi = Kε(sobs − si ) ≡ πε(sobs |si )
as a weighted particle approximation to the posterior
π(θ|sobs) =∑
wiδθi (θ)
where wi = Wi/∑
Wj are normalised weights
The Naradaya-Watson estimator suffers from the curse ofdimensionality - its rate of convergence drops rapidly as thedimension of s increases.
Idea 2 - regression adjustmentsConsider the relationship between the conditional expectation of θ and s:
E(θ|s) = m(s)
Think of this as a model for the conditional density π(θ|s): for fixed s
θi = m(s) + ei
where θi ∼ π(θ|s) and ei are zero-mean and uncorrelated
Suppose we’ve estimated m(s) by m(s) from samples {θi , si}.Estimate the posterior mean by
E(θ|sobs) ≈ m(sobs)
and assuming constant variance (wrt s), we can form the empiricalresiduals
ei = θi − m(si )
and approximate the posterior π(θ|sobs) by adjusting the parameters
θ∗i = m(sobs) + ei = θi + (m(sobs)− m(si ))
Idea 2 - regression adjustmentsConsider the relationship between the conditional expectation of θ and s:
E(θ|s) = m(s)
Think of this as a model for the conditional density π(θ|s): for fixed s
θi = m(s) + ei
where θi ∼ π(θ|s) and ei are zero-mean and uncorrelated
Suppose we’ve estimated m(s) by m(s) from samples {θi , si}.Estimate the posterior mean by
E(θ|sobs) ≈ m(sobs)
and assuming constant variance (wrt s), we can form the empiricalresiduals
ei = θi − m(si )
and approximate the posterior π(θ|sobs) by adjusting the parameters
θ∗i = m(sobs) + ei = θi + (m(sobs)− m(si ))
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
180 190 200 210 220
1.7
1.8
1.9
2.0
2.1
2.2
2.3
ABC and regression adjustment
S
thet
a
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
− ε + εs_obs
In rejection ABC, the red points are used to approximate the histogram.
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
180 190 200 210 220
1.7
1.8
1.9
2.0
2.1
2.2
2.3
ABC and regression adjustment
S
thet
a
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
− ε + εs_obs
●
●
●
●
In rejection ABC, the red points are used to approximate the histogram.Using regression-adjustment, we use the estimate of the posterior mean atsobs and the residuals from the fitted line to form the posterior.
Models
Beaumont et al. 2003 used a local linear model for m(s) in the vicinity ofsobs
m(si ) = α + βT si
fit by minimising ∑(θi −m(si ))2Kε(si − sobs)
so that observations nearest to sobs are given more weight in the fit.
The empirical residuals are then weighted so that the approximation tothe posterior is a weighted particle set
{θ∗i ,Wi = Kε(si − sobs)}π(θ|sobs) = m(sobs) +
∑wiδθ∗i (θ)
Models
Beaumont et al. 2003 used a local linear model for m(s) in the vicinity ofsobs
m(si ) = α + βT si
fit by minimising ∑(θi −m(si ))2Kε(si − sobs)
so that observations nearest to sobs are given more weight in the fit.
The empirical residuals are then weighted so that the approximation tothe posterior is a weighted particle set
{θ∗i ,Wi = Kε(si − sobs)}π(θ|sobs) = m(sobs) +
∑wiδθ∗i (θ)
Normal-normal conjugate model, linear regression
1.8 1.9 2.0 2.1
02
46
8
Posteriors
theta
Den
sity
ABCTrueRegression adjusted
200 data points in both approximations. The regression-adjusted ABCgives a more confident posterior, as the θi have been adjusted to accountfor the discrepancy between si and sobs
Extensions: Non-linear modelsBlum and Francois 2010 proposed a nonlinear heteroscedastic model
θi = m(si ) + σ(su)ei
where m(s) = E(θ|s) and σ2(s) = Var(θ|s). They used feed-forwardneural networks for both the conditional mean and variance.
Blum and OF (2009) suggest the use of non-linear conditional heteroscedastic regression models
θ∗i = m(sobs)+
(θi − m(si ))σ(sobs)
σ(si )
Picture from Michael Blum, www.ceremade.dauphine.fr/ xian/ABCOF.pdf
Discussion
These methods allow us to use a larger tolerance values and cansubstantially improve posterior accuracy with less computation.However, sequential algorithms can not easily be adapted, and sothese methods tend to be used with simple rejection sampling.
Many people choose not to use these methods, as they can give poorresults if the model is badly chosen.
Modelling variance is hard, so transformations to make the θ = m(s)as homoscedastic as possible (such as Box-Cox transformations) areusually applied
Blum 2010 contains estimates of the bias and variance of theseestimators. They show the properties of the ABC estimators mayseriously deteriorate as dim(s) increases ...
Summary Statistics
References:
Blum, Nunes, Prangle and Sisson 2012
Joyce and Marjoram 2008
Nunes and Balding 2010
Fearnhead and Prangle 2012
Robert et al. 2011
Error trade-offBlum, Nunes, Prangle, Fearnhead 2012
The error in the ABC approximation can be broken into two parts1 Choice of summary:
π(θ|D)?≈ π(θ|S(D))
2 Use of ABC acceptance kernel:
π(θ|sobs)?≈ πABC (θ|sobs) =
∫π(θ, s|sobs)ds
∝∫πε(sobs |S(x))π(x |θ)π(θ)dx
The first approximation allows the matching between S(D) and S(X ) tobe done in a lower dimension. There is a trade-off
dim(S) small: π(θ|sobs) ≈ πABC (θ|sobs), but π(θ|sobs) 6≈ π(θ|D)
dim(S) large: π(θ|sobs) ≈ π(θ|D) but π(θ|sobs) 6≈ πABC (θ|sobs)as curse of dimensionality forces us to use larger ε
Error trade-offBlum, Nunes, Prangle, Fearnhead 2012
The error in the ABC approximation can be broken into two parts1 Choice of summary:
π(θ|D)?≈ π(θ|S(D))
2 Use of ABC acceptance kernel:
π(θ|sobs)?≈ πABC (θ|sobs) =
∫π(θ, s|sobs)ds
∝∫πε(sobs |S(x))π(x |θ)π(θ)dx
The first approximation allows the matching between S(D) and S(X ) tobe done in a lower dimension. There is a trade-off
dim(S) small: π(θ|sobs) ≈ πABC (θ|sobs), but π(θ|sobs) 6≈ π(θ|D)
dim(S) large: π(θ|sobs) ≈ π(θ|D) but π(θ|sobs) 6≈ πABC (θ|sobs)as curse of dimensionality forces us to use larger ε
Error trade-offBlum, Nunes, Prangle, Fearnhead 2012
The error in the ABC approximation can be broken into two parts1 Choice of summary:
π(θ|D)?≈ π(θ|S(D))
2 Use of ABC acceptance kernel:
π(θ|sobs)?≈ πABC (θ|sobs) =
∫π(θ, s|sobs)ds
∝∫πε(sobs |S(x))π(x |θ)π(θ)dx
The first approximation allows the matching between S(D) and S(X ) tobe done in a lower dimension. There is a trade-off
dim(S) small: π(θ|sobs) ≈ πABC (θ|sobs), but π(θ|sobs) 6≈ π(θ|D)
dim(S) large: π(θ|sobs) ≈ π(θ|D) but π(θ|sobs) 6≈ πABC (θ|sobs)as curse of dimensionality forces us to use larger ε
Choosing summary statistics
If S(D) = sobs is sufficient for θ, i.e., sobs contains all the informationcontained in D about θ
π(θ|sobs) = π(θ|D),
then using summaries has no detrimental effect
However, low-dimensional sufficient statistics are rarely available. How do
we choose good low dimensional summaries?
The choice is one of the most important parts of ABC algorithms
Insights from ML methods?
Choosing summary statistics
If S(D) = sobs is sufficient for θ, i.e., sobs contains all the informationcontained in D about θ
π(θ|sobs) = π(θ|D),
then using summaries has no detrimental effect
However, low-dimensional sufficient statistics are rarely available. How do
we choose good low dimensional summaries?
The choice is one of the most important parts of ABC algorithms
Insights from ML methods?
Choosing summary statistics
If S(D) = sobs is sufficient for θ, i.e., sobs contains all the informationcontained in D about θ
π(θ|sobs) = π(θ|D),
then using summaries has no detrimental effect
However, low-dimensional sufficient statistics are rarely available. How do
we choose good low dimensional summaries?
The choice is one of the most important parts of ABC algorithms
Insights from ML methods?
Automated summary selectionBlum, Nunes, Prangle and Fearnhead 2012
Suppose we are given a candidate set S = (s1, . . . , sp) of summaries fromwhich to choose.
Methods break down into groups.
Best subset selectionI Joyce and Marjoram 2008I Nunes and Balding 2010
ProjectionI Blum and Francois 2010I Fearnhead and Prangle 2012
Regularisation techniquesI Blum, Nunes, Prangle and Fearnhead 2012
Best subset selection
Introduce a criterion, e.g,
τ -sufficiency (Joyce and Marjoram 2008): s1:k−1 are τ -sufficientrelative to sk if
δk = supθ
log π(sk |s1:k−1, θ)− infθ
log π(sk |s1:k−1, θ)
= rangeθ(π(s1:k |θ)− π(s1:k−1|θ)) ≤ τ
i.e. adding sk changes posterior sufficiently.
Entropy (Nunes and Balding 2010)
Implement within a search algorithm such as forward selection.Problems:
assumes every change to posterior is beneficial (see below)
considerable computational effort required to compute δk
ProjectionSeveral statistics from S may be required to get same info content as asingle informative summary.
project S onto a lower dimensional highly informative summary vector
Most authors aim to find summaries so that
πABC (θ|s) ≈ π(θ|D)
Fearnhead and Prangle 2012 weaken this requirement and instead aim tofind summaries that lead to good parameter estimates.
They seek to minimise the expected posterior loss
E((θtrue − θ)2|D) =⇒ θ = E(θ|D)
They show that the optimal summary statistic is
s = E(θ|D)
ProjectionSeveral statistics from S may be required to get same info content as asingle informative summary.
project S onto a lower dimensional highly informative summary vector
Most authors aim to find summaries so that
πABC (θ|s) ≈ π(θ|D)
Fearnhead and Prangle 2012 weaken this requirement and instead aim tofind summaries that lead to good parameter estimates.
They seek to minimise the expected posterior loss
E((θtrue − θ)2|D) =⇒ θ = E(θ|D)
They show that the optimal summary statistic is
s = E(θ|D)
ProjectionSeveral statistics from S may be required to get same info content as asingle informative summary.
project S onto a lower dimensional highly informative summary vector
Most authors aim to find summaries so that
πABC (θ|s) ≈ π(θ|D)
Fearnhead and Prangle 2012 weaken this requirement and instead aim tofind summaries that lead to good parameter estimates.
They seek to minimise the expected posterior loss
E((θtrue − θ)2|D) =⇒ θ = E(θ|D)
They show that the optimal summary statistic is
s = E(θ|D)
Fearnhead and Prangle 2012However, E(θ|D) will not usually be known.
Instead, we can estimate it using the model
θi = E(θ|D) + ei = βT f (si ) + ei
where f (s) is a vector of functions of S and (θi , si ) are output from apilot ABC simulation. They choose the set of regressors using, e.g., BIC.
They then use the single summary statistic
s = βT f (s)
for θ.
Advantages
Scales well with large p and gives good point estimates.
Disadvantages
Summaries usually lack interpretability and method gives noguarantees about the approximation of the posterior.
Fearnhead and Prangle 2012However, E(θ|D) will not usually be known.
Instead, we can estimate it using the model
θi = E(θ|D) + ei = βT f (si ) + ei
where f (s) is a vector of functions of S and (θi , si ) are output from apilot ABC simulation. They choose the set of regressors using, e.g., BIC.
They then use the single summary statistic
s = βT f (s)
for θ.
Advantages
Scales well with large p and gives good point estimates.
Disadvantages
Summaries usually lack interpretability and method gives noguarantees about the approximation of the posterior.
Summary warning:
Automated methods are a poor replacement for expert knowledge.
Instead of automation, ask what aspects of the data do we expectour model to be able to reproduce? S(D) may be highly informativeabout θ, but if the model was not built to reproduce S(D) then whyshould we calibrate to it?
I For example, many dynamical systems models are designed to modelperiods and amplitudes. Summaries that are not phase invariant maybe informative about θ, but this information is uninformative.
In the case where models and/or priors are mis-specified, thisproblem can be particularly acute.
The rejection algorithm is usually used in summary selectionalgorithms, as otherwise we need to rerun the MCMC or SMCsampler for each new summary which is very expensive.
Summary warning:
Automated methods are a poor replacement for expert knowledge.
Instead of automation, ask what aspects of the data do we expectour model to be able to reproduce? S(D) may be highly informativeabout θ, but if the model was not built to reproduce S(D) then whyshould we calibrate to it?
I For example, many dynamical systems models are designed to modelperiods and amplitudes. Summaries that are not phase invariant maybe informative about θ, but this information is uninformative.
In the case where models and/or priors are mis-specified, thisproblem can be particularly acute.
The rejection algorithm is usually used in summary selectionalgorithms, as otherwise we need to rerun the MCMC or SMCsampler for each new summary which is very expensive.
Summary warning:
Automated methods are a poor replacement for expert knowledge.
Instead of automation, ask what aspects of the data do we expectour model to be able to reproduce? S(D) may be highly informativeabout θ, but if the model was not built to reproduce S(D) then whyshould we calibrate to it?
I For example, many dynamical systems models are designed to modelperiods and amplitudes. Summaries that are not phase invariant maybe informative about θ, but this information is uninformative.
In the case where models and/or priors are mis-specified, thisproblem can be particularly acute.
The rejection algorithm is usually used in summary selectionalgorithms, as otherwise we need to rerun the MCMC or SMCsampler for each new summary which is very expensive.
Model selectionWilkinson 2007, Grelaud et al. 2009
Ratmann et al. 2009 proposed methodology for testing the fit of a modelwithout reference to other models.
But often we want to compare models → Bayes factors
B12 =π(D|M1)
π(D|M2)
where π(D|Mi ) =∫πε(D|x)π(x |θ,Mi )π(θ)dxdθ
For rejection ABC
π(D) ≈ 1
N
∑πε(D|xi )
which reduces to the acceptance rate for uniform ABC (Wilkinson 2007).
Or add an initial step into the rejection algorithm where we first pick amodel - compare the ratio of acceptance rates to directly target the BF.
See Toni et al. 2009 for an SMC-ABC approach.
Model selectionWilkinson 2007, Grelaud et al. 2009
Ratmann et al. 2009 proposed methodology for testing the fit of a modelwithout reference to other models.
But often we want to compare models → Bayes factors
B12 =π(D|M1)
π(D|M2)
where π(D|Mi ) =∫πε(D|x)π(x |θ,Mi )π(θ)dxdθ
For rejection ABC
π(D) ≈ 1
N
∑πε(D|xi )
which reduces to the acceptance rate for uniform ABC (Wilkinson 2007).
Or add an initial step into the rejection algorithm where we first pick amodel - compare the ratio of acceptance rates to directly target the BF.
See Toni et al. 2009 for an SMC-ABC approach.
Summary statistics for model selectionDidelot et al. 2011, Robert et al. 2011
Care needs to be taken with regard summary statistics for model selection.Everything is okay if we target
BS =π(S(D)|M1)
π(S(D)|M2)
Then the ABC estimator BεS → BS as ε→ 0,N →∞ (Didelot et al.2011).
However,π(S(D)|M1)
π(S(D)|M2)6= π(D|M1)
π(D|M2)= BD
even if S is a sufficient statistic! S sufficient for f1(D|θ1) and f2(D|θ2)does not imply sufficiency for {m, fm(D|θm)}. Hence BεS 6→ BD
Note - no problem if we view inference as conditional on a carefullychosen S .See Prangle et al. 2013 for automatic selection of summaries for modelselection.
Summary statistics for model selectionDidelot et al. 2011, Robert et al. 2011
Care needs to be taken with regard summary statistics for model selection.Everything is okay if we target
BS =π(S(D)|M1)
π(S(D)|M2)
Then the ABC estimator BεS → BS as ε→ 0,N →∞ (Didelot et al.2011).However,
π(S(D)|M1)
π(S(D)|M2)6= π(D|M1)
π(D|M2)= BD
even if S is a sufficient statistic! S sufficient for f1(D|θ1) and f2(D|θ2)does not imply sufficiency for {m, fm(D|θm)}. Hence BεS 6→ BD
Note - no problem if we view inference as conditional on a carefullychosen S .See Prangle et al. 2013 for automatic selection of summaries for modelselection.
Summary statistics for model selectionDidelot et al. 2011, Robert et al. 2011
Care needs to be taken with regard summary statistics for model selection.Everything is okay if we target
BS =π(S(D)|M1)
π(S(D)|M2)
Then the ABC estimator BεS → BS as ε→ 0,N →∞ (Didelot et al.2011).However,
π(S(D)|M1)
π(S(D)|M2)6= π(D|M1)
π(D|M2)= BD
even if S is a sufficient statistic! S sufficient for f1(D|θ1) and f2(D|θ2)does not imply sufficiency for {m, fm(D|θm)}. Hence BεS 6→ BD
Note - no problem if we view inference as conditional on a carefullychosen S .See Prangle et al. 2013 for automatic selection of summaries for modelselection.
Choice of metric ρ
Consider the following system
Xt+1 = f (Xt) + N(0, σ2) (4)
Yt = g(Xt) + N(0, τ2) (5)
where we want to estimate measurement error τ and model error σ.Default choice of metric (or similar)
ρ(Y , yobs) =∑
(yobst − Yt)2
or CRPS (a proper scoring rule)
ρ(yobs ,F (·)) =∑
crps(yobst ,Ft(·)) =∑t
∫(Ft(u)− Iyt≤u)2du
where Ft(·) is the distribution function of Yt |y1:t−1.
sigma^2
tau^2
0 2 4 6 8 10
02
46
810
sigma^2
tau^2
0 2 4 6 8 10
02
46
810
GP-accelerated ABC
Problems with Monte Carlo methods
Monte Carlo methods are generally guaranteed to succeed if we run themfor long enough.
This guarantee comes at a cost.
Most methods sample naively - they don’t learn from previoussimulations.
They don’t exploit known properties of the likelihood function, suchas continuity
They sample randomly, rather than using space filling designs.
This naivety can make a full analysis infeasible without access to a largeamount of computational resource.
Likelihood estimation
The GABC framework assumes
π(D|θ) =
∫π(D|X )π(X |θ)dX
≈ 1
N
∑π(D|Xi )
where Xi ∼ π(X |θ).
For many problems, we believe the likelihood is continuous and smooth,so that π(D|θ) is similar to π(D|θ′) when θ − θ′ is small
We can model L(θ) = π(D|θ) and use the model to find the posterior inplace of running the simulator.
Likelihood estimation
The GABC framework assumes
π(D|θ) =
∫π(D|X )π(X |θ)dX
≈ 1
N
∑π(D|Xi )
where Xi ∼ π(X |θ).
For many problems, we believe the likelihood is continuous and smooth,so that π(D|θ) is similar to π(D|θ′) when θ − θ′ is small
We can model L(θ) = π(D|θ) and use the model to find the posterior inplace of running the simulator.
History matching waves
The likelihood is too difficult to model, so we model the log-likelihoodinstead.
G (θ) = log L(θ), L(θi ) =1
N
∑π(D|Xi ), Xi ∼ π(X |θi )
However, the log-likelihood for a typical problem ranges across too wide arange of values.
Consequently, any Gaussian process model will struggle to model thelog-likelihood across the entire input range.
Introduce waves of history matching, similar to those used in MichaelGoldstein’s work.
In each wave, build a GP model that can rule out regions of space asimplausible.
History matching waves
The likelihood is too difficult to model, so we model the log-likelihoodinstead.
G (θ) = log L(θ), L(θi ) =1
N
∑π(D|Xi ), Xi ∼ π(X |θi )
However, the log-likelihood for a typical problem ranges across too wide arange of values.
Consequently, any Gaussian process model will struggle to model thelog-likelihood across the entire input range.
Introduce waves of history matching, similar to those used in MichaelGoldstein’s work.
In each wave, build a GP model that can rule out regions of space asimplausible.
History matching waves
The likelihood is too difficult to model, so we model the log-likelihoodinstead.
G (θ) = log L(θ), L(θi ) =1
N
∑π(D|Xi ), Xi ∼ π(X |θi )
However, the log-likelihood for a typical problem ranges across too wide arange of values.
Consequently, any Gaussian process model will struggle to model thelog-likelihood across the entire input range.
Introduce waves of history matching, similar to those used in MichaelGoldstein’s work.
In each wave, build a GP model that can rule out regions of space asimplausible.
Results - Design 1 - 128 pts
Diagnostics for GP 1 - threshold = 5.6
Results - Design 2 - 314 pts - 38% of space implausible
Diagnostics for GP 2 - threshold = -21.8
Design 3 - 149 pts - 62% of space implausible
Diagnostics for GP 3 - threshold = -20.7
Design 4 - 400 pts - 95% of space implausible
Diagnostics for GP 4 - threshold = -16.4
MCMC Results
3.0 3.5 4.0 4.5 5.0
01
23
45
67
Wood’s MCMC posterior
r
Density
0.0 0.2 0.4 0.6 0.8
0.0
1.0
2.0
3.0
Green = GP posterior
sig.e
Density
5 10 15 20
0.0
0.2
0.4
Black = Wood’s MCMC
phi
Density
Computational details
The Wood MCMC method used 105 × 500 simulator runs
The GP code used (128 + 314 + 149 + 400) = 991× 500 simulatorruns
I 1/100th of the number used by Wood’s method.
By the final iteration, the Gaussian processes had ruled out over 98% ofthe original input space as implausible,
the MCMC sampler did not need to waste time exploring thoseregions.
ConclusionsABC allows inference in models for which it would otherwise beimpossible.
not a silver bullet - if likelihood methods possible, use them instead.
Algorithms and post-hoc regression can greatly improve computationalefficiency, but computation is still usually the limiting factor.
Challenge is to develop more efficient methods to allow inference inmore expensive models.
Areas for improvement (particularly those relevant to ML)?
Automatic summary selection and dimension reduction
Improved modelling in regression adjustments
Learning of model error πε(D|X )
Accelerated inference via likelihood modelling
Use of variational methods
. . .
Thank you for listening!
[email protected], www.maths.nottingham.ac.uk/personal/pmzrdw/
ConclusionsABC allows inference in models for which it would otherwise beimpossible.
not a silver bullet - if likelihood methods possible, use them instead.
Algorithms and post-hoc regression can greatly improve computationalefficiency, but computation is still usually the limiting factor.
Challenge is to develop more efficient methods to allow inference inmore expensive models.
Areas for improvement (particularly those relevant to ML)?
Automatic summary selection and dimension reduction
Improved modelling in regression adjustments
Learning of model error πε(D|X )
Accelerated inference via likelihood modelling
Use of variational methods
. . .
Thank you for listening!
[email protected], www.maths.nottingham.ac.uk/personal/pmzrdw/
ConclusionsABC allows inference in models for which it would otherwise beimpossible.
not a silver bullet - if likelihood methods possible, use them instead.
Algorithms and post-hoc regression can greatly improve computationalefficiency, but computation is still usually the limiting factor.
Challenge is to develop more efficient methods to allow inference inmore expensive models.
Areas for improvement (particularly those relevant to ML)?
Automatic summary selection and dimension reduction
Improved modelling in regression adjustments
Learning of model error πε(D|X )
Accelerated inference via likelihood modelling
Use of variational methods
. . .
Thank you for listening!
[email protected], www.maths.nottingham.ac.uk/personal/pmzrdw/
References - basics
Included in order of appearance in tutorial, rather than importance! Far from
exhaustive - apologies to those I’ve missed
Murray, Ghahramani, MacKay, NIPS, 2012
Tanaka, Francis, Luciani and Sisson, Genetics 2006.
Wilkinson, Tavare, Theoretical Population Biology, 2009,
Neal and Huang, arXiv, 2013.
Beaumont, Zhang, Balding, Genetics 2002
Tavare, Balding, Griffiths, Genetics 1997
Diggle, Gratton, JRSS Ser. B, 1984
Rubin, Annals of Statistics, 1984
Wilkinson, SAGMB 2013.
Fearnhead and Prangle, JRSS Ser. B, 2012
Kennedy and O’Hagan, JRSS Ser. B, 2001
References - algorithms
Marjoram, Molitor, Plagnol, Tavare, PNAS, 2003
Sisson, Fan, Tanaka, PNAS, 2007
Beaumont, Cornuet, Marin, Robert, Biometrika, 2008
Toni, Welch, Strelkowa, Ipsen, Stumpf, Interface, 2009.
Del Moral, Doucet, Stat. Comput. 2011
Drovandi, Pettitt, Biometrics, 2011.
Lee, Proc 2012 Winter Simulation Conference, 2012.
Lee, Latuszynski, arXiv, 2013.
Del Moral, Doucet, Jasra, JRSS Ser. B, 2006.
Sisson and Fan, Handbook of MCMC, 2011.
References - links to other algorithms
Craig, Goldstein, Rougier, Seheult, JASA, 2001
Fearnhead and Prangle, JRSS Ser. B, 2011.
Wood Nature, 2010
Nott and Marshall, Water resources research, 2012
Nott, Fan, Marshall and Sisson, arXiv, 2012.
GP-ABC:
Wilkinson, arXiv, 2013
Meeds and Welling, arXiv, 2013.
References - regression adjustment
Beaumont, Zhang, Balding, Genetics, 2002
Blum, Francois, Stat. Comput. 2010
Blum, JASA, 2010
Leuenberger, Wegmann, Genetics, 2010
References - summary statistics
Blum, Nunes, Prangle, Sisson, Stat. Sci., 2012
Joyce and Marjoram, Stat. Appl. Genet. Mol. Biol., 2008
Nunes and Balding, Stat. Appl. Genet. Mol. Biol., 2010
Fearnhead and Prangle, JRSS Ser. B, 2011
Wilkinson, PhD thesis, University of Cambridge, 2007
Grelaud, Robert, Marin Comptes Rendus Mathematique, 2009
Robert, Cornuet, Marin, Pillai PNAS, 2011
Didelot, Everitt, Johansen, Lawson, Bayesian analysis, 2011.