An intro to particle methods for parameter inference in state-space models PhD course FMS020F–NAMS002 “Statistical inference for partially observed stochastic processes”, Lund University http://goo.gl/sX8vU9 Umberto Picchini Centre for Mathematical Sciences, Lund University www.maths.lth.se/matstat/staff/umberto/ Umberto Picchini ([email protected])
85
Embed
Particle MCMC methods for parameter inference in state space models
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
An intro to particle methods for parameterinference in state-space models
PhD course FMS020F–NAMS002 “Statistical inference for partiallyobserved stochastic processes”, Lund University
This lecture will explore possibilities offered by particle filters, alsoknown as Sequential Monte Carlo (SMC) methods, when applied toparameter estimation problems.
We will not give a thorough introduction to SMC methods.
Instead, we will introduce the most basic (popular) SMC method withthe main goal of performing inference for θ, a vector of unknownparameters entering a state space / Hidden Markov model.
Results anticipation
Thanks to recent and astonishing results, we show how it is possibleto construct a practical method for exact Bayesian inference on theparameters of a state-space model.
We consider hereafter the (important) case where data are modelledaccording to an Hidden Markov Model (HMM), often denotedstate-space model (SSM).
(In recent literature the terms SSM and HMM have been usedinterchangeably. We do the same here, though elsewhere it issometimes assumed that HMMs→discrete space andSSM →continuous space)
A state-space model refers to a class of probabilistic graphical modelsthat describes the probabilistic dependence between latent statevariables and the observed measurements of a dynamical system.
Notation: here and in the following N(µ,σ2) is the Gaussiandistribution with mean µ and variance σ2. N(x;µ,σ2) is theevaluation at x of the pdf of a N(µ,σ2) distribution.
Always remember X0 is NOT the state for the first observational time,that one is X1. Instead X0 is the (typically unknown) system’s initialstate for process {Xt}.
X0 can be set to be a deterministic constant (as in the example we aregoing to discuss soon).
In general {Xt} and {Yt} can be either continuous– or discrete–valuedstochastic processes. However in the following we assume {Xt} and{Yt} to be defined on continuous spaces.
In all previous slides we haven’t made explicit reference to thepresence of unknown constants that we wish to estimate.
For example in the Gaussian random walk example the unknownswould be the variances (q2, r2) (and perhaps the initial state x0, insome situation).
In that case we would like to estimate θ = (q2, r2) using observationsy1:T . More in general...
Main goal
We introduce general methods for SSM producing inference for thevector of parameters θ. We will be particularly interested in Bayesianinference.
Of course θ can contain all sorts of unknowns, not just variances.Umberto Picchini ([email protected])
A quick look into the final goal
p(y1:T |θ) is the likelihood of the measurements conditionally onθ.π(θ) is the prior density of θ (we always assumecontinuous-valued parameters). It encloses knowledge about θbefore we “see” our current data y1:T .Bayes theorem gives the posterior distribution:
π(θ|y1:T) =p(y1:T |θ)π(θ)
p(y1:T)∝ p(y1:T |θ)π(θ)
inference based on π(θ|y1:T) is called Bayesian inference.p(y1:T) is the marginal likelihood (evidence), independent of θ.Goal: calculate (sample draws from) π(θ|y1:T).Remark: θ is a random quantity in the Bayesian framework.
As you know... (remember we set some background requirements fortaking this course) Bayesian inference can rarely be performedwithout using some Monte Carlo sampling.
Except for simple cases (e.g. data y1, ..., yT independently distributedfrom some member of the exponential family) it is usually impossibleto write the likelihood p(y1:T |θ) in closed form.
Since early 90’s MCMC (Markov chain Monte Carlo) has opened thepossibility to perform Bayesian inference in practice.
Are we obsessed with Bayesian inference? NO! However itsometimes offers the easiest way to deal with complex, non-trivialmodels. Some good reads
The Bayesian approach actually opens the possibility for a surprisingresult (see later on...).
First of all, notice that in the Bayesian framework, since θ israndom, we do not simply write the likelihood function aspθ(y1:T) nor p(y1:T ; θ), but we must condition on θ and writep(y1:T |θ).
In a SSM data are not independent, they are only conditionallyindependent→ complication!:
p(y1:T |θ) = p(y1|θ)
T∏t=2
p(yt|y1:t−1, θ) =?
We don’t have a closed for expression for the product above becausewe do not know how to calculate p(yt|y1:t−1, θ).
Despite the analytic difficulties, find approximations for thelikelihood function is possible (we’ll consider some approachsoon).
In some simple cases, closed form solutions do exist: forexample when the SSM is linear and Gaussian (see the Gaussianrandom walk example) then the classic Kalman filter gives theexact likelihood.
In the SSM literature important (Gaussian) approximations aregiven by the extended and unscented Kalman filters. However,approximations offered by particle filters (a.k.a. sequentialMonte Carlo) are presently the state-of-art for general non-linearnon-Gaussian SSM.
By repeating steps 1-2 as much as wanted we are guaranteed that, bydiscarding a “long enough” number of iterations (burnin), theremaining draws form
a Markov chain (hence dependent values) having π(θ|y1:T) astheir stationary distribution.
Therefore if you have produced R = R1 + R2 iterations ofMetropolis-Hastings, where R1 is a sufficiently long burnin, forscalar θ you can then plot the histogram of the last R2 drawsθR1+1, ..., θR−R1 . Such histogram gives the density π(θ|y1:T), upto a Monte Carlo error induced by using a finite R2.
for a vector valued θ ∈ Rp create p separate histograms of thedraws pertaining each component of θ. Such histogramsrepresent the posterior marginals π(θj|y), j = 1, ..., p.
Notice Metropolis-Hastings is not an optimization algorithm.
Unlike in maximum likelihood estimation here we are not trying toconverge towards some mode.
What we want is to explore thoroughly (i.e. sample from) π(θ|data),including its tails.
This way we can directly assess the uncertainty about θ by looking atπ(θ|data), instead of having to resort to asymptotic argumentsregarding the sampling distribution of θ̂n when n→∞ as inML-theory.
When h(·) is chosen in an intelligent way, an important property is theone that allows sequential update of weights. After some derivationon the board, we have (see p. 121 in Särkkä1 and p. 5 in Cappe et al.2)
wit ∝
p(yt|xit)p(x
it|x
it−1)
h(xit|xi
0:t−1, y1:t)wi
t−1.
However, the dependence of wt on wt−1 is a curse (with remedy) asdescribed in next slide.
1Särkkä, available here.2Cappe, Godsill and Moulines. Available here.
Particle degeneracy occurs when at time t all but one of theimportance weights wi
t are close to zero. This implies a poorapproximation to p(yt|y1:t−1).
Notice that when a particle gets a zero weight (or a “small” positiveweight that your computer sets to zero→ numerical underflow) thatparticle is doomed! Since
wit ∝
p(yt|xit)p(x
it|x
it−1)
h(xit|xi
0:t−1, y1:t)wi
t−1.
if for a given i we have wit−1 = 0 particle i will have zero weight for
A life saving solution is to use resampling with replacement (Gordonet al.3).
1 interpret w̃it as the probability to sample xi
t from the weighted set{xi
t, w̃it, i = 1, ..., N}, with w̃i
t := wit/∑
i wit.
2 draw N times with replacement from the weighted set. Replacethe old particles with the new ones {x̃1
t , ..., x̃Nt }.
3 Reset weights wit = 1/N (the resampling has destroyed the
information on “how” we reached time t).
Since resampling is done with replacement, a particle with a largeweight is likely to be drawn multiple times. Particles with very smallweights are not likely to be drawn at all. Nice!
3Gordon, Salmond and Smith. IEEE Proceedings F. 140(2) 1993.Umberto Picchini ([email protected])
Sequential Importance Sampling Re-sampling (SISR)
We are now in the position to introduce a method that samples fromh(·), so that these samples can be used as if they were fromp(x0:t|y1:t−1) provided these are appropriately re-weighted.
1 Sample x1t , ..., xN
t ∼ h(·) (same as before)
2 compute weights wit (same as before)
3 normalize weights w̃it = wi
t/∑N
i=1 wit
4 We have a discrete probability distribution {xit, w̃i
t}, i = 1, ..., N
5 Resample N times with replacement from the set {x1t , ..., xN
So the bootstrap filter by et al. (1993)4 easily provide what we need!
p̂(yt|y1:t−1) =1N
∑Ni=1 wi
t
Finally a likelihood approximation:
p̂(y1:T) = p̂(y1)
T∏t=2
p̂(yt|y1:t−1)
Put back θ in the notation so to obtain:
approximate maximum likelihood
θmle = argmaxθp̂(y1:T ; θ)
or
exact Bayesian inference by using p̂(y1:T |θ) insideMetropolis-Hastings.why exact?? let’s check it after the example...
4Gordon, Salmond and Smith. IEEE Proceedings F. 140(2) 1993.Umberto Picchini ([email protected])
Back to the nonlinear SSM example
We can now comment on how we obtained previously shown results(reproposed here). We used the bootstrap filter with N = 500 particles andR = 10, 000 MCMC iterations.
When coding your algorithms try to consider the following beforenormalizing weights
code unnormalised weights on log-scale: e.g. whenwi
t = N(yt|xit) the exp() in the Gaussian pdf will likely produce
an underflow (wit = 0) for xi
t far from yt.Solution: reason in terms of logw instead of w.
However afterwards we necessarily have to go back tow:=exp(logw) then normalize and the above might still notbe enough.Solution: subtract the maximum (log)weight from each(log)weight, e.g. set logw:=logw-max(logw). This istotally ok, the importance of particles is unaffected, weights areonly scaled by the same constant c=max(logw).
Quite astonishingly Andrieu and Roberts5 proved that using an unbiasedestimate of the likelihood function into the MCMC routine is sufficient toobtain exact Bayesian inference!
That is using the acceptance ratio
A =p̂(y1:T |θ
∗)
p̂(y1:T |θ)× π(θ
∗)
π(θ)× q(θ|θ∗)
q(θ∗|θ)
will return a Markov chain with stationary distribution π(θ|y1:T) regardlessthe finite number N of particles used to approximate the likelihood!.
The good news is that E(p̂(y1:T |θ)) = p(y1:T |θ) with p̂(y1:T |θ) obtained viaSMC.
5Andrieu and Roberts (2009), Annals of Statistics, 37(2) 697–725.Umberto Picchini ([email protected])
The previous result will be considered, in my opinion, one of the mostimportant statistical results of the XXI century.
In fact, it offers an “exact-approximate” approach, where because ofcomputing limitations we can only produce N <∞ particles, while still bereassured to obtain exact (Bayesian) inference under minor assumptions.
But let’s give a rapid (technically informal) look at why it works.
Key result: unbiasedness (del Moral 20046)
We have that
E(p̂(y1:T |θ)) =
∫p̂(y1:T |θ, ξ)p(ξ)dξ = p(y1:T |θ)
with ξ ∼ p(ξ) vector of all random variates generated during SMC (both topropagate forward the state and to perform particles resampling).
6Easier to look at Pitt, Silva, Giordani, Kohn. J. Econometrics 171, 2012.Umberto Picchini ([email protected])
To prove the exactness of the approach we look at the (easier and lessgeneral) argument in sec. 2.2 of Pitt, Silva, Giordani, Kohn. J.Econometrics 171, 2012.
To simplify the notation take y := y1:T .
π̂(θ, ξ|y) approximate joint posterior of (θ, ξ) obtained via SMC
π̂(θ, ξ|y) =p̂(y|θ, ξ)p(ξ)π(θ)
p(y)
(notice ξ and θ are assumed a-priori independent)
Notice we put p(y) not p̂(y) at the denominator: this followsfrom the unbiasedeness assumption as we obtain∫ ∫
Now, we know that applying an MCMC targeting π̂(θ, ξ|y) thendiscarding the output pertaining to ξ corresponds to integrate-out ξfrom the posterior∫
π̂(θ, ξ|y)dξ =π(θ|y)p(y|θ)
∫p̂(y|θ, ξ)p(ξ)dξ︸ ︷︷ ︸E(p̂(y|θ))=p(y|θ)
= π(θ|y)
We are thus performing a pseudo-marginal approach: “marginal”because we disregard ξ; pseudo because we use p̂(·) not p(·).
Therefore we proved that, using MCMC on an (artificially)augmented posterior, then discard from the output all the randomvariates created during SMC returns exact Bayesian inference.
Notice that discarding the ξ is something that we naturally do inMetropolis-Hastings hence nothing strange is happening here. The ξare just instrumental, uninteresting, variates independent of θ andindependent of {Xt}.
The pomp pmcmc function runs a pseudo-marginal MCMC. Assumewe are only interested in (r,φ) and keep σ = 0.3 constant (“known”).Here we use 500 particles with 5,000 MCMC iterations. We start at(r,φ) = (7.4, 5). We assumed flat (improper) priors.
We used a non-adaptive Gaussian random walk (variances are keptfixed). You might get better results with an adaptive version.
A topic set as an (optional) exercise is to have a thought at why themethod fails at estimating parameters of a nearly deterministic(smaller σ) stochastic Ricker model.
Furthermore, for Nt deterministic (σ = 0), where particle filters arenot applicable (nor needed), exact likelihood calculation is alsochallenging.
A great coverage of this issue is in Fasiolo et al. (2015)arXiv:1411.4564 comparing particle marginal methods, approximateBayesian computation, iterated filtering and more. Very muchrecommended read.
Doucet, Pitt, and Kohn. Efficient implementation of Markovchain Monte Carlo when using an unbiased likelihood estimator.arXiv:1210.1871 (2012).
Pitt, dos Santos Silva, Giordani and Kohn. On some properties ofMarkov chain Monte Carlo simulation methods based on theparticle filter. Journal of Econometrics 171, no. 2 (2012):134-151.
Sherlock, Thiery, Roberts and Rosenthal. On the efficiency ofpseudo-marginal random walk Metropolis algorithms.arXiv:1309.7209 (2013).
This is the simplest to explain (not the most efficient though) resamplingscheme.
At time t we wish to sample from a population of weighted particles(xi
t, w̃it, i = 1, ..., N). What we actually do is to sample N times particle
indeces with replacement from the population (i, w̃it, i = 1, ..., N). This will
be a sample of size N from a multinomial distribution.
Pick at particle from the “urn”, the larger its probability w̃it the more likely it
will be picked. Record its index i and put it back in the urn. Repeat for atotal of N times.
To code the sampling procedure we just need to recall the inverse transformmethod.For a generic random variable X, let FX be an invertible cdf. We can samplean x from FX using x := F−1
For example let’s start from a simple example of multinomialdistribution, the Bernoulli distribution.
X ∈ {0, 1} with p = P(X = 1), 1 − p = P(X = 0). Then
FX(x) =
0 x < 01 − p 0 6 x < 11 x > 1
(1)
Draw the “stair” represented by the plot of FX . Generate a u ∼ U(0, 1)and “hit the stair’s steps”. If 0 < u 6 1 − p then set x := 0 and ifu > 1 − p set x := 1.
For the multinomial case it is a simple generalization. Drop time t andset w̃i = pi. FX is a stair with N steps. Shoot a u ∼ U(0, 1) and returnindex i
Cool reads on Bayesian methods (titles are linked)
You have an engineeristic/signal processing background: checkS. Särkkä “Bayesian Filtering and Smoothing” (free PDF fromthe author!)
You are a data-scientist: check K. Murphy “Machine Learning: aprobabilistic perspective”.
You are a theoretical statistician: C. Robert “The BayesianChoice”.
You are interested in bioinformatics/systems biology: check D.Wilkinson “Stochastic Modelling for Systems Biology, 2ed.”.
You are interested in inference for SDEs with applications to lifesciences: check the book by Wilkinson above and C. Fuchs“Inference for diffusion processes”.
Cool reads on Bayesian methods (titles are linked)
You are a computational statistician: check “Handbook ofMCMC”. Older (but excellent) titles are: J. Liu “Monte Carlostrategies in Scientific Computing” and Casella-Robert “MonteCarlo Statistical Methods”.
You want a practical hands-on and (almost) maths freeintroduction: check “The BUGS Book” and “Doing BayesianData Analysis”.